WO2025240193A1 - Signal synthesis using dynamic decoder complexity scaling based on device load - Google Patents

Signal synthesis using dynamic decoder complexity scaling based on device load

Info

Publication number
WO2025240193A1
WO2025240193A1 PCT/US2025/028254 US2025028254W WO2025240193A1 WO 2025240193 A1 WO2025240193 A1 WO 2025240193A1 US 2025028254 W US2025028254 W US 2025028254W WO 2025240193 A1 WO2025240193 A1 WO 2025240193A1
Authority
WO
WIPO (PCT)
Prior art keywords
decoder
neural network
complexity
audio
neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/028254
Other languages
French (fr)
Inventor
Zisis Iason Skordilis
Daniel Jared Sinder
Vivek Rajendran
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Publication of WO2025240193A1 publication Critical patent/WO2025240193A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters

Definitions

  • Audio coding (also referred to as voice coding and/or speech coding) is a technique used to represent a digitized audio signal using as few bits as possible (thus compressing the speech data), while attempting to maintain a certain level of audio quality.
  • An audio or voice encoder is used to encode (or compress) the digitized audio (e.g., speech, music, etc.) signal to a lower bit-rate stream of data.
  • the lower bit-rate stream of data can be input to an audio or voice decoder, which decodes the stream of data and constructs an approximation or reconstruction of the original signal.
  • the audio or voice encoder-decoder structure can be referred to as an audio coder (or voice coder or speech coder) or an audio/voice/speech coder-decoder (codec).
  • Audio coders exploit the fact that speech signals are highly correlated waveforms.
  • Some speech coding techniques are based on a source-filter model of speech production, which assumes that the vocal cords are the source of spectrally flat sound (an excitation signal), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech.
  • the different phonemes e.g., vowels, fricatives, and voice fricatives
  • source excitation
  • spectral shape filter
  • the apparatus includes at least one memory and at least one processor coupled to the at least one memory and configured to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit.
  • a method for processing audio data includes: obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of a decoder; determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models each having a respective complexity; processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and outputting the decoded audio frame within the configured processing time completion limit.
  • a non-transitory computer-readable medium includes instructions that, when executed by at least one processor, cause the at least one processor to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit.
  • an apparatus for processing audio data includes: means for obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; means for obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; means for determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; means for processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and means for outputting the decoded audio frame within the configured processing time completion limit.
  • aspects may be implemented via integrated chip implementations or other non-module-component based devices (e.g., end-user devices, vehicles, communication devices, computing devices, industrial equipment, retail/purchasing devices, medical devices, and/or artificial intelligence devices).
  • non-module-component based devices e.g., end-user devices, vehicles, communication devices, computing devices, industrial equipment, retail/purchasing devices, medical devices, and/or artificial intelligence devices.
  • aspects may be implemented in chip-level components, modular components, non-modular components, non-chip-level components, device-level components, and/or system-level components.
  • Devices incorporating described aspects and features may include additional components and features for implementation and practice of claimed and described aspects.
  • transmission and reception of wireless signals may include one or more components for analog and digital purposes (e.g., hardware components including antennas, radio frequency (RF) chains, power amplifiers, modulators, buffers, processors, interleavers, adders, and/or summers).
  • RF radio frequency
  • aspects described herein may be practiced in a wide variety of devices, components, systems, distributed arrangements, and/or end-user devices of varying size, shape, and constitution.
  • Other objects and advantages associated with the aspects disclosed herein will be apparent to those skilled in the art based on the accompanying drawings and detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of Qualcomm Docket No.2400178WO the claimed subject matter.
  • FIG.1 is a block diagram illustrating an example speech processing system, in accordance with some examples.
  • FIG. 2A is a block diagram illustrating an example feature generator, in accordance with some examples
  • FIG.2B is a block diagram illustrating an example of a voice coding system, in accordance with some examples
  • FIG. 3 is a block diagram illustrating an example of a voice coding signal synthesis system utilizing a linear time-varying filter generated using a neural network model and a separate linear predictive coding (LPC) filter, in accordance with some examples
  • LPC linear predictive coding
  • FIG.4 is a block diagram illustrating components of a user equipment (UE), in accordance with some examples
  • FIG. 5 is a block diagram illustrating an example of a dynamic decoder complexity switching system that may be used with a decoder of a signal synthesis system, in accordance with some examples;
  • FIG. 6A is a diagram illustrating a plurality of example decoder architectures associated with a respective synthesis complexity, in accordance with some examples;
  • FIG.6B is a diagram illustrating example decoder architectures each associated with a respective synthesis complexity, in accordance with some examples; Qualcomm Docket No.2400178WO [0023] FIG.
  • FIG. 7 is a diagram illustrating an example of dynamic decoder complexity adjustment based on determining an elapsed execution time for the current frame after one processing by one or more subsets of a plurality of decoder layers, in accordance with some examples;
  • FIG.8 is a block diagram illustrating an example of an audio codec system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer, in accordance with some examples;
  • FIG.9 is a block diagram illustrating an example of a decoder of a voice coding signal synthesis system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer comprising a neural homomorphic vocoder (NHV), in accordance with some examples;
  • FIG.10 is a block diagram illustrating an example of a decoder of a voice coding signal synthesis system that can be used to generate reconstructed audio (e.g., synthesized
  • FIG. 11 is a flow chart illustrating an example of a process for processing one or more audio samples, in accordance with some examples.
  • FIG. 12 is a block diagram illustrating an example of a computing system for implementing certain aspects described herein.
  • DETAILED DESCRIPTION [0029] Certain aspects of this disclosure are provided below for illustration purposes. Alternate aspects may be devised without departing from the scope of the disclosure. Additionally, well-known elements of the disclosure will not be described in detail or will be omitted so as not to obscure the relevant details of the disclosure. Some of the aspects described herein may be applied independently and some of them may be applied in combination as would be apparent to those of skill in the art.
  • Audio coding (e.g., speech coding, music signal coding, or other type of audio coding) can be performed on a digitized audio signal (e.g., a speech signal) to compress the amount of data for storage, transmission, and/or various other uses.
  • a voice coding system can be an audio coding system that is used to perform audio coding on a speech signal that represents the speech (e.g., voiced words or sounds) uttered by a user of the system.
  • a voice coding system can also be referred to as a voice or speech coder or a voice coder-decoder (codec).
  • a voice coding system may include one or more encoders (e.g., voice encoders) and one or more decoders (e.g., voice decoders).
  • a voice encoder can be used to process an input speech signal.
  • Input speech signals may include a digitized speech signal generated from an analog speech signal from a given source, where the resulting digitized speech signal is a discrete-time speech signal with sample values (e.g., also referred to as samples or audio samples) that are also discretized.
  • Sample values e.g., also referred to as samples or audio samples
  • the samples of an input speech signal can be divided into blocks of N samples each, where a block of N samples is referred to as a frame.
  • each frame can be 10-20 milliseconds (ms) in length.
  • a voice encoder can generate a compressed signal corresponding to the input, digitized speech signal.
  • the compressed signal generated by the voice encoder can include a lower bitrate stream of audio data that represents the speech signal using as few bits as possible, while attempting (e.g., by the voice encoder) to maintain a certain quality level for the speech.
  • the compressed signal generated by the voice encoder can have a lower bitrate than the uncompressed digitized speech signal provided as input to the voice encoder.
  • voice encoders may be used to reduce the bitrate of an input speech signal.
  • the bitrate of a signal is based on the sampling frequency (e.g., the sampling frequency of a microphone used to sample a user’s speech) and the number of bits per sample.
  • the bitrate of a signal can be equal to S*b, where S represents the sampling frequency and b represents the number of bits per sample.
  • Qualcomm Docket No.2400178WO [0033]
  • the output of a voice encoder may be a compressed speech signal.
  • the compressed speech signal can be stored and/or transmitted to and processed by a corresponding voice decoder.
  • the corresponding voice decoder can be a voice decoder that is associated with the voice encoder and/or a voice decoder that is configured to decode the compressed speech signals generated by the voice encoder.
  • a voice decoder may communicate with a voice encoder, for example to request speech data, to transmit feedback information, and/or to provide other communications to or from the voice encoder and voice decoder.
  • the voice decoder can be configured to decode the data of the compressed speech signal and construct a reconstructed speech signal that approximates the original speech signal.
  • the reconstructed speech signal may include a digitized, discrete-time signal with a same or similar bitrate as the bitrate of the original speech signal.
  • the voice decoder may implement an inverse of a voice coding algorithm used by the voice encoder.
  • a voice decoder may be included in a voice calling system and/or a voice call processing system.
  • a voice call processing system can be used to provide voice communication services between different users and/or different devices.
  • Voice call processing systems can be used for telephony (e.g., audio-only calling), including 4G LTE voice calls (e.g., Voice over LTE (VoLTE)), 5G NR voice calls (e.g., Voice over New Radio (VoNR)), Voice over Internet Protocol (VoIP) voice calls, etc.
  • 4G LTE voice calls e.g., Voice over LTE (VoLTE)
  • 5G NR voice calls e.g., Voice over New Radio (VoNR)
  • VoIP Voice over Internet Protocol
  • Voice call processing systems may also be used for video calls, where the voice call processing system is used for the audio portion of the video call and may be separate from a video coding system used for the video portion of the video call.
  • a voice call processing system can be implemented as a real-time system that is configured to manage the end-to-end flow of voice data packets between communicating parties.
  • a voice call processing system can implement various protocols, algorithms, and/or audio processing techniques to enhance a voice call signal and/or to meet a realo-time delivery requirement (e.g., a real-time processing requirement) for the voice data packets of the voice call.
  • Voice call processing can be a time-sensitive and/or time-constrained process.
  • a decoder may receive audio frames and/or encoded representations thereof from an encoder, with a periodicity that is the same as or similar to the periodicity of the input frames of audio data to the encoder.
  • an encoder may receive a stream of audio data (e.g., an input speech signal) comprising a plurality of 20ms frames.
  • the Qualcomm Docket No.2400178WO encoder may generate features or other encoded representations of a respective 20ms frame, and transmit the features or other encoded information to the decoder.
  • the decoder may receive the features or encoded information at approximately the same 20ms periodicity that is associated with the plurality of 20ms audio frames at the input of the encoder.
  • a decoder may receive an encoded representation of a first speech frame. 20ms later, the decoder may receive an encoded representation of a second speech frame. An additional 20ms later, the decoder may receive an encoded representation of a third speech frame, etc.
  • a real-time processing requirement for speech decoders and/or speech synthesizers that receive a stream of encoded audio or speech frames may correspond to the length (e.g., duration) of each frame and/or may correspond to the periodicity between consecutive frames of the stream.
  • the decoder should complete processing (e.g., decoding and/or synthesis processing) of a respective speech frame before the next speech frame is received. Processing of the next frame cannot begin until processing has completed for the current frame. Exceeding the real-time processing time limit for the current frame may cause the decoder to delay the beginning of processing of the next frame (e.g., the decoder starts processing the next frame late).
  • the real-time requirement may correspond to the decoder needing to finish decoding and/or synthesis processing within 20ms after reception of each frame. If decoding and/or synthesis processing for a respective frame takes longer than the frame duration (e.g., 20ms), delay may accumulate at the decoder and for the playback of the decoded or synthesized audio.
  • one or more machine learning systems, networks, models, etc. can be used to perform audio coding (e.g., encoding and/or decoding of one or more audio samples).
  • machine learning systems using one or more neural network models can be used to implement a voice coder to encode and/or decode voice data (e.g., speech signals, etc.).
  • voice data e.g., speech signals, etc.
  • machine learning systems e.g., using one or more neural network models
  • a neural network-based voice decoder can generate coefficients for Qualcomm Docket No.2400178WO one or more linear filters, learned filters, etc., that may be used to perform voice decoding.
  • a neural network-based voice decoder can be trained to estimate the parameters (e.g., linear filter coefficients, etc.) of a speech synthesis pipeline, where learned filters that are generated or tuned using a neural network-based voice decoder may subsequently be used to generate a reconstructed speech signal (e.g., a synthesized speech signal, a speech synthesis signal, etc.).
  • Neural network-based voice or speech decoders may also be referred to as neural decoders. Neural decoders can be more computationally complex to implement than non- neural or non-machine learning-based voice or speech decoders.
  • a neural decoder may be associated with a longer frame processing time (e.g., a longer frame decoding time) than a non-neural decoder, when the neural decoder and the non-neural decoder are implemented or executed on the same computational device, platform, hardware, etc.
  • the use of neural decoders with resource-constrained devices such as UEs, smartphones, mobile computing devices, wearables, etc., may also be associated with increased frame processing times by the neural decoder, including increased frame processing times beyond the real-time processing requirement associated with the frame length (e.g., 20ms speech frames, etc.).
  • Systems, apparatuses, methods (also referred to as processes), and computer- readable media are described herein that can be used to provide dynamic complexity scaling for a neural network-based decoder and/or signal synthesis engine configured to process one or more frames of data (e.g., audio data, voice data, speech data, video data, etc.).
  • the frames of Qualcomm Docket No.2400178WO data processed by the decoder can be associated with a real-time processing requirement (e.g., a real-time processing time limit) for the decoder to finish decoding and/or processing of a respective frame.
  • the one or more frames of data can be speech frames, where the real-time processing requirement is equal to the length of each speech frame (e.g., 20ms).
  • the systems and techniques can be used to provide dynamic decoder complexity scaling to finish processing of a respective frame of audio data within a configured processing time limit (e.g., the real-time processing requirement for the frame of audio data, the duration of each respective frame of audio data, etc.).
  • the dynamic decoder complexity scaling can be based on device load information corresponding to a UE or other computing device used to implement the neural network- based decoder.
  • the dynamic decoder complexity scaling can be based on measured runtime information corresponding to one or more previously decoded frames processed by the neural network-based decoder or signal synthesizer.
  • the measured runtime information can be indicative of an elapsed execution time for decoding or processing the one or more previously decoded frames.
  • Device load information can be used to dynamically reduce the neural network- based decoder or synthesizer engine complexity based on a determination of a relatively high current device load state.
  • the device load information can be used to dynamically increase the neural network-based decoder or synthesizer engine complexity based on a determination of a relatively low current device load state.
  • the device load information can be used to dynamically reduce or increase the decoder complexity so that the decoder remains within the configured time limit for processing (e.g., decoding, synthesizing, etc.) each respective frame.
  • the device load information can be used to dynamically reduce or increase the decoder complexity so that the decoder remains within a 20ms real-time processing requirement associated with a stream of 20ms speech frames provided to the decoder.
  • the device load information can be obtained and/or determined by a device resource monitor engine configured to monitor the hardware resource utilization of a UE or other computing device used to implement the neural network-based decoder.
  • the device load information can include a current central processing unit (CPU) usage or utilization percentage and/or can include a current Qualcomm Docket No.2400178WO usage or utilization percentage for various other processors included in the device.
  • the processor utilization information may be instantaneous utilization information corresponding to the current state of the CPU or other processor.
  • the processor utilization information can be an average utilization percentage over a configured time window.
  • the device load information can include or otherwise be indicative of an average CPU usage over the previous 5ms, 10ms, 15ms, etc.
  • the device load information can include a memory (e.g., random- access memory (RAM), processor cache, video RAM (VRAM), etc.) usage of the UE or other computing device used to implement the neural network-based decoder.
  • the memory usage information can be indicative of an instantaneous memory percentage in use, an average memory percentage used over a configured time window (e.g., average memory usage over the previous 5ms, 10ms, 15ms, etc.), and/or a combination thereof.
  • the device load information can be obtained from a device resource monitor or device resource manager that is included in and/or implemented by the UE or other computing device that implements the neural network-based decoder.
  • the device load information can be obtained from a performance monitoring or profiling library included in and/or implemented by an operating system of the device that implements the decoder.
  • the systems and techniques may monitor CPU state information and/or CPU statistics information to determine a dynamic complexity scaling adjustment for processing or decoding one or more upcoming frames at the neural network-based decoder or other neural network-based signal synthesis engine.
  • the device load information can be monitored, collected, obtained, and/or analyzed in real-time by a decoder complexity switching engine configured to generate an output indicative of a scaling adjustment corresponding to increased decoder complexity, a scaling adjustment corresponding to decreased decoder complexity, and/or a scaling adjustment corresponding to the same decoder complexity.
  • the decoder complexity switching engine can be included in the neural network-based decoder, can be implemented by the neural network-based decoder, and/or can be associated with the neural network-based decoder.
  • a complexity scaling indication can be generated and/or determined by the decoder complexity switching engine and may be used to configure a Qualcomm Docket No.2400178WO switch coupled between an input bitstream received at the decoder and a plurality of different decoder model instances, where each decoder model instance of the plurality of decoder model instances is associated with a respective synthesis complexity.
  • each decoder model instance of the plurality of decoder model instances is associated with a different respective synthesis complexity.
  • a first subset of the plurality of decoder model instances may be associated with a relatively high synthesis complexity
  • a second subset of the plurality of decoder model instances may be associated with a relatively low synthesis complexity, etc.
  • the different decoder model instances can implement different complexity (e.g., synthesis complexity, decoding complexity, processing or processing time complexity, etc.) based on the respective decoder model instances utilizing different neural network and/or machine learning architectures.
  • a first neural network architecture may be used to implement a first level of complexity
  • a second neural network architecture may be used to implement a second level of complexity, etc.
  • the different decoder model instances can implement different complexity based on the respective decoder model instances utilizing different neural network or machine learning weights or parameters. For example, a first set of neural network weights can be used to parameterize a neural network architecture to a first level of complexity, a second set of neural network weights can be used to parameter a same or different neural network architecture to a second level of complexity, etc.
  • the different complexity decoder model instances may share one or more neural network layers and can scale in complexity based on combining the shared neural network layers with one or more complexity level-specific subsets of corresponding neural network layers.
  • a first complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the first complexity level.
  • a second complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the second complexity level, etc.
  • lower complexity decoders can be implemented as subsets of the plurality of neural network layers include in the highest complexity decoder model instance.
  • one or more neural network layers and/or subsets of neural network layers of the highest complexity decoder model instance can be deactivated to Qualcomm Docket No.2400178WO implement lower complexity level decoder instances.
  • FIG.1 illustrates an example implementation of a system-on-a-chip (SoC) 100, which may include a central processing unit (CPU) 102 or a multi-core CPU, configured to perform one or more of the functions described herein.
  • SoC system-on-a-chip
  • Parameters or variables e.g., neural signals and synaptic weights
  • system parameters associated with a computational device e.g., neural network with weights
  • delays e.g., frequency bin information, task information, among other information
  • NPU neural processing unit
  • GPU graphics processing unit
  • DSP digital signal processor
  • Instructions executed at the CPU 102 may be loaded from a program memory associated with the CPU 102 or may be loaded from a memory block 118.
  • the SoC 100 may also include additional processing blocks tailored to specific functions, such as a GPU 104, a DSP 106, a connectivity block 110, which may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, and the like, and a multimedia processor 112 that may, for example, detect and recognize gestures, speech, and/or other interactive user action(s) or input(s).
  • the NPU 108 is implemented in the CPU 102, DSP 106, and/or GPU 104.
  • the SoC 100 may also include a sensor processor 114, image signal processors (ISPs) 116, and/or a signal synthesis system 120.
  • ISPs image signal processors
  • the signal synthesis system 120 can be implemented or configured as a speech synthesis system, including a neural speech decoder and/or a neural homomorphic vocoder (NHV) system, which can be used to generate speech (e.g., perform speech synthesis), can be implemented in a text-to-speech (TTS) system, etc.
  • the sensor processor 114 can be associated with or connected to one or more sensors for providing sensor input(s) to sensor processor 114.
  • the one Qualcomm Docket No.2400178WO or more sensors and the sensor processor 114 can be provided in, coupled to, or otherwise associated with a same computing device.
  • the one or more sensors can include one or more microphones for receiving sound (e.g., an audio input), including sound or audio inputs that can be used to perform various speech synthesis tasks and/or to generate a reconstructed speech signal from the audio input, etc.
  • the sound or audio input received by the one or more microphones (and/or other sensors) may be digitized into data packets for analysis and/or transmission.
  • the audio input may include ambient sounds in the vicinity of a computing device associated with the SoC 100 and/or may include speech from a user of the computing device associated with the SoC 100.
  • a computing device associated with the SoC 100 can additionally, or alternatively, be communicatively coupled to one or more peripheral devices (not shown) and/or configured to communicate with one or more remote computing devices or external resources, for example using a wireless transceiver and a communication network, such as a cellular communication network.
  • SoC 100, DSP 106, NPU 108 and/or signal synthesis (e.g., neural speech decoder, NHV, etc.) system 120 may be configured to perform audio signal processing.
  • the signal synthesis system 120 may be configured to perform steps for speech synthesis and/or neural homomorphic vocoding, etc.
  • FIG. 2A depicts an example of a feature generator 200, in accordance with aspects of the present disclosure. It should be understood that many techniques may be used to generate feature vectors for an audio input and that feature generator 200 is just a single example of a technique that may be used to generate feature vectors.
  • Feature generator 200 receives an audio signal at signal pre-processor 202.
  • the audio signal may be from an audio source of an electronic device, such a microphone.
  • Signal pre-processor 202 may perform various pre-processing steps on the received audio signal. For example, signal pre-processor 202 may split the audio signal Qualcomm Docket No.2400178WO into parallel audio signals and delay one of the signals by a predetermined amount of time to prepare the audio signals for input into a Fast-Fourier Transform (FFT) circuit.
  • FFT Fast-Fourier Transform
  • signal pre-processor 202 may perform a windowing function, such as a Hamming, Hann, Blackman-Harris, Kaiser-Bessel window function, or other sine-based window function, which may improve the performance of further processing stages, such as signal domain transformer 204.
  • a windowing (or window) function in may be used to reduce the amplitude of discontinuities at the boundaries of each finite sequence of received audio signal data to improve further processing.
  • signal pre-processor 202 may convert the audio signal data from parallel to serial, or vice versa, for further processing.
  • the pre-processed audio signal from the signal pre-processor 202 may be provided to signal domain transformer 204, which may transform the pre-processed audio signal from a first domain into a second domain, such as from a time domain into a frequency domain.
  • signal domain transformer 204 implements a Fourier transform, such as a Fast-Fourier transform (FFT).
  • FFT Fast-Fourier transform
  • the Fast Fourier transform may be a 16-band (or bin, channel, or point) FFT, which generates a compact feature set that may be efficiency processed by a model.
  • a Fourier transform provides fine spectral domain information about the incoming audio signal as compared to conventional single channel processing, such as conventional hardware SNR threshold detection.
  • the result of signal domain transformer 204 is a set of audio features, such as a set of voltages, powers, or energies per frequency band in the transformed data.
  • the set of audio features may then be provided to signal feature filter 206, which may reduce the size of or compress the feature set in the audio feature data.
  • signal feature filter 206 may discard certain features from the audio feature set, such as symmetric or redundant features from multiple bands of a multi-band FFT. Discarding this data reduces the overall size of the data stream for further processing and may be referred to a compressing the data stream. For example, in some cases, a 16-band FFT may include 8 symmetric or redundant bands of after the powers are squared because audio signals are real. Thus, signal feature filter 206 may filter out the redundant or symmetric band information and output an audio feature vector 208. In some cases, output of the signal feature filter may be compressed or otherwise processed prior to output as the audio feature vector 208.
  • the audio feature vector 208 may be provided to a speech synthesis system (e.g., such as the signal synthesis system 120 of FIG.
  • Audio coding (e.g., speech coding, music signal coding, or other type of audio coding) can be performed on a digitized audio signal (e.g., a speech signal) to compress the amount of data for storage, transmission, and/or other use.
  • FIG.2B is a block diagram illustrating an example of a voice coding system 250 (which can also be referred to as a voice or speech coder or a voice coder-decoder (codec)).
  • a voice encoder 252 of the voice coding system 250 can use a voice coding algorithm to process a speech signal 251.
  • the speech signal 251 can include a digitized speech signal generated from an analog speech signal from a given source.
  • the digitized speech signal can be generated using a filter to eliminate aliasing, a sampler to convert to discrete-time, and an analog- to-digital converter for converting the analog signal to the digital domain.
  • the resulting digitized speech signal (e.g., speech signal 251) is a discrete-time speech signal with sample values (referred to herein as samples) that are also discretized.
  • the voice encoder 252 can generate a compressed signal (including a lower bit-rate stream of data) that represents the speech signal 251 using as few bits as possible, while attempting to maintain a certain quality level for the speech.
  • the voice encoder 252 can use any suitable voice coding algorithm, such as a linear prediction coding algorithm (e.g., Code-excited linear prediction (CELP), algebraic-CELP (ACELP), or other linear prediction technique) or other voice coding algorithm.
  • the voice encoder 252 can compress the speech signal 251 in an attempt to reduce the bit-rate of the speech signal 251.
  • the bit-rate of a signal is based on the sampling frequency and the number of bits per sample. For instance, the bit-rate of a speech signal can be determined as ⁇ ⁇ ⁇ ⁇ ⁇ , where BR is the bit-rate, S is the sampling frequency, and b is the number of bits per sample.
  • the bit-rate (BR) of a signal would be a bit-rate of 128 kilobits per second (kbps).
  • S sampling frequency
  • b 16 bits per sample
  • BR bit-rate
  • the compressed speech signal can then be stored and/or sent to and processed by a voice decoder 254.
  • the voice decoder 254 can communicate with the voice encoder 252, such as to request speech data, send feedback information, and/or provide other communications to the voice encoder 252.
  • the voice encoder 252 or a channel encoder can perform channel coding on the compressed speech Qualcomm Docket No.2400178WO signal before the compressed speech signal is sent to the voice decoder 254.
  • channel coding can provide error protection to the bitstream of the compressed speech signal to protect the bitstream from noise and/or interference that can occur during transmission on a communication channel.
  • the voice decoder 254 can decode the data of the compressed speech signal and construct a reconstructed speech signal 255 that approximates the original speech signal 251.
  • the reconstructed speech signal 255 includes a digitized, discrete-time signal that can have the same or similar bit-rate as that of the original speech signal 251.
  • the voice decoder 254 can use an inverse of the voice coding algorithm used by the voice encoder 252, which as noted above can include any suitable voice coding algorithm, such as a linear prediction coding algorithm (e.g., CELP, ACELP, or other suitable linear prediction technique) or other voice coding algorithm.
  • the reconstructed speech signal 255 can be converted to continuous-time analog signal, such as by performing digital-to- analog conversion and anti-aliasing filtering.
  • Voice coders can exploit the fact that speech signals are highly correlated waveforms.
  • the samples of an input speech signal can be divided into blocks of N samples each, where a block of N samples is referred to as a frame.
  • each frame can be 10-20 milliseconds (ms) in length.
  • Various voice coding algorithms can be used to encode a speech signal.
  • code-excited linear prediction CELP
  • the CELP model is based on a source-filter model of speech production, which assumes that the vocal cords are the source of spectrally flat sound (an excitation signal), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech.
  • the different phonemes e.g., vowels, fricatives, and voice fricatives
  • filter spectral shape
  • CELP uses a linear prediction (LP) model to model the vocal tract, and uses entries of a fixed codebook (FCB) as input to the LP model.
  • LP linear prediction
  • FCB fixed codebook
  • long- term linear prediction can be used to model pitch of a speech signal
  • short-term linear prediction can be used to model the spectral shape (phoneme) of the speech signal.
  • Entries in the FCB are based on coding of a residual signal that remains after the long-term and short-term linear prediction modeling is performed.
  • long-term linear prediction and short-term linear prediction models can be used for speech synthesis, and Qualcomm Docket No.2400178WO a fixed codebook (FCB) can be searched during encoding to locate the best residual for input to the long-term and short-term linear prediction models.
  • the FCB provides the residual speech components not captured by the short-term and long-term linear prediction models.
  • a residual, and a corresponding index can be selected at the encoder based on an analysis-by-synthesis process that is performed to choose the best parameters so as to match the original speech signal as closely as possible.
  • the index can be sent to the decoder, which can extract the corresponding LTP residual from the FCB based on the index.
  • FIG.3 is a diagram illustrating an example of a voice decoding signal synthesis system 300 utilizing a linear time-varying filter 304 with coefficients generated using a neural network (NN) filter estimator 302 and a separate linear predictive coding (LPC) filter 306.
  • the voice decoding signal synthesis system 300 is configured to decode data of the compressed speech signal to generate a reconstructed speech signal ⁇ (also referred to as a synthesized speech sample) for a current time instant n that approximates an original speech signal that was previously compressed by a voice encoder (not shown).
  • the voice encoder can be similar to and can perform some or all of the functions of the voice encoder 252 described above with respect to FIG. 2B, or other type of voice encoder.
  • the voice encoder can include a short-term LP engine, an LTP engine, and an FCB.
  • the voice encoder can include a magnitude spectrum generator (e.g., Mel-scale magnitude spectrum or full spectrum magnitude), a short-term linear prediction (LP) engine, and a pitch tracker that detects a fundamental pitch harmonic frequency of the speech and pitch correlation.
  • the voice encoder can extract (and in some cases quantize) a set of features (referred to as a feature set) from the speech signal, and can send the extracted (and in some cases quantized) feature set to the voice decoding signal synthesis system 300.
  • the features that are computed by the voice encoder can depend on a particular encoder implementation used.
  • feature sets are provided below according to different encoder implementations, which can be extracted by the voice encoder (and in some cases quantized), and sent to the voice decoding signal synthesis system 300.
  • the voice encoder can extract any set of features, can quantize that feature set, and can send the feature set to the voice decoding signal synthesis system 300.
  • Qualcomm Docket No.2400178WO As noted above, various combinations of features can be extracted as a feature set by the voice encoder.
  • a feature set can include one or any combination of the following features: Linear Prediction (LP) coefficients; Line Spectral Pairs (LSPs); Line Spectral Frequencies (LSFs); pitch lag with integer or fractional accuracy; pitch gain; pitch correlation; Mel-scale frequency cepstral coefficients (also referred to as Mel cepstrum) of the speech signal; Bark-scale frequency cepstral coefficients (also referred to as bark cepstrum) of the speech signal; Mel-scale frequency cepstral coefficients of the LTP residual; Bark-scale frequency cepstral coefficients of the LTP residual; a spectrum (e.g., Discrete Fourier Transform (DFT) or other spectrum) of the speech signal; and/or a spectrum (e.g., DFT or other spectrum) of the LTP residual; voicing level of each frequency band of each speech frame; fundamental frequency of pitch harmonics; pitch correlation of each speech frame; time domain pitch lag of each speech frame.
  • LP Linear Prediction
  • LSPs Line
  • the voice encoder can use any estimation and/or quantization method, such as an engine or algorithm from any suitable voice codec (e.g. EVS, AMR, or other voice codec) or a neural network-based estimation and/or quantization scheme (e.g., convolutional or fully-connected (dense) or recurrent Autoencoder, or other neural network-based estimation and/or quantization scheme).
  • voice codec e.g. EVS, AMR, or other voice codec
  • a neural network-based estimation and/or quantization scheme e.g., convolutional or fully-connected (dense) or recurrent Autoencoder, or other neural network-based estimation and/or quantization scheme.
  • the voice encoder can also use any frame size, frame overlap, and/or update rate for each feature.
  • the voice encoder can also include extra redundancies in the features to ensure robustness of operation against packet losses.
  • estimation and quantization methods for each example feature are provided below for illustrative purposes, where other examples of estimation and quantization methods can be used by the voice encoder.
  • one example of features that can be extracted from a voice signal by the voice encoder includes linear prediction (LP) coefficients and/or line spectral frequencies (LSFs).
  • LSFs line spectral frequencies
  • Various estimation techniques can be used to compute the LP coefficients and/or LSFs.
  • the voice encoder can estimate LP coefficients (and/or LSFs) from a speech signal using the Levinson-Durbin algorithm.
  • the LP coefficients and/or LSFs can be estimated using an autocovariance method for LP estimation.
  • the LP coefficients can be determined, and an LP to LSF conversion algorithm can be performed to obtain the LSFs. Any other LP and/or LSF estimation engine or algorithm can be used, such as an LP and/or LSF estimation engine or algorithm from an existing codec (e.g., EVS, AMR, or other voice Qualcomm Docket No.2400178WO codec).
  • Various quantization techniques can be used to quantize the LP coefficients and/or LSFs.
  • the voice encoder can use a single stage vector quantization (SSVQ) technique, a multi-stage vector quantization (MSVQ), or other vector quantization technique to quantize the LP coefficients and/or LSFs.
  • SSVQ single stage vector quantization
  • MSVQ multi-stage vector quantization
  • a predictive or adaptive SSVQ or MSVQ can be used to quantize the LP coefficients and/or LSFs.
  • an autoencoder or other neural network based technique can be used by the voice encoder to quantize the LP coefficients and/or LSFs. Any other LP and/or LSF quantization engine or algorithm can be used, such as an LP and/or LSF quantization engine or algorithm from an existing codec (e.g., EVS, AMR, or other voice codec).
  • EVS EVS, AMR, or other voice codec
  • Autoencoders perform dimensionality reduction (e.g., an N-dimensional vector input vector is passed into the encoder of the autoencoder, and lossy compression is performed to represent the important aspects of the input vector in an M-dimensional vector, where M is smaller than N, e.g. by an order of magnitude).
  • a drawback of using an autoencoder for speech coding is that it cannot necessarily exploit the temporal relationship between sets of input data.
  • U.S. Patent No. 11,526,734 (“the ‘734 patent”, assigned to Qualcomm, Inc.), "Method and Apparatus for recurrent auto encoding", Yang et. al.
  • FRAE feedback recurrent autoencoder
  • the FRAE in the '734 patent can be used for training and application of compression of sequential data with temporal correlation.
  • the recurrent structure of the FRAE can be used to efficiently extract the redundancy embedded along Qualcomm Docket No.2400178WO the time-dimension of sequential data and enables compact discrete representation of the data at the bottleneck in a sequential fashion.
  • Table 1 of the '734 patent illustrates the MSE (Mean Squared Error) used as Mel-scale mean-square-error configured as a reconstruction loss for training of the FRAE, where the MSE of each frequency bin is scaled according to its weight at Mel-frequency both for latent feedback and output feedback.
  • the FRAE described in the ‘734 patent has two advantageous features not present in autoencoders: (1) recurrent layers, e.g. LSTM or GRU layers, have memory of the past; and (2) feedback from the decoder of the autoencoder to the encoder of the autoencoder.
  • the ‘734 patent describes a second feedback connection (152) from the state (ht) of the decoder for a first iteration of series of inputs Xt, to the next iteration of series of inputs X t+1 .
  • This second feedback connection (152) enables the decoder to learn from its previous reconstruction attempts, providing additional historical context about how reconstruction of a prior input has fared.
  • This second feedback loop is also present during both training and inference, which allows the decoder to be trained to respond to particular feedback conditions in a manner that improves reproduction of the input data by the decoder.
  • the ‘734 patent also describes other optional feedback connections.
  • an embedding vector (z) may be fed back via a third feedback connection (356 in Fig.3 of the ‘734 patent) to the encoder.
  • the FRAE can be a variational autoencoder.
  • an output of the decoder (754 in Fig. 7 of the ‘734 patent) is sample and parameterized to a generate an autoregressive prior which can be used as a fourth feedback connection to condition a prior model for the next latent space embedded vector (zt+1).
  • each of these optional connections is present both training and inference, and thus are trained as part of the training process thereby improving functionality of the FRAE.
  • pitch lag integer and/or fractional
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. 1
  • pitch gain e.g. pitch gain
  • correlation estimation engine e.g. autocorrelation- based pitch lag estimation
  • voice encoder can use a pitch lag, gain, and/or correlation estimation engine (or algorithm) from any suitable voice codec (e.g. EVS, AMR, or other voice codec).
  • voice codec e.g. EVS, AMR, or other voice codec
  • the voice encoder can quantize the pitch lag, pitch gain, and/or pitch correlation (or any combination thereof) from a speech signal using any pitch lag, gain, correlation quantization engine or algorithm from any suitable voice codec (e.g. EVS, AMR, or other voice codec).
  • voice codec e.g. EVS, AMR, or other voice codec
  • an autoencoder or other neural network based technique can be used by the voice encoder to quantize the pitch lag, pitch gain, and/or pitch correlation features.
  • Another example of features that can be extracted from a voice signal by the voice encoder includes the Mel cepstrum coefficients and/or Bark cepstrum coefficients of the speech signal, and/or the Mel cepstrum coefficients and/or Bark cepstrum coefficients of the LTP residual.
  • Various estimation techniques can be used to compute the Mel cepstrum coefficients and/or Bark cepstrum coefficients.
  • the voice encoder can use a Mel or Bark frequency cepstrum technique that includes Mel or Bark frequency filter banks computation, filter bank energy computation, logarithm application, and discrete cosine transform (DCT) or truncation of the DCT.
  • DCT discrete cosine transform
  • Various quantization techniques can be used to quantize the Mel cepstrum coefficients and/or Bark cepstrum coefficients. For example, vector quantization (single stage or multistage) or predictive/adaptive vector quantization can be used. In some cases, an autoencoder or other neural network based technique can be used by the voice encoder to quantize the Mel cepstrum coefficients and/or Bark cepstrum coefficients. Any other suitable cepstrum quantization methods can be used. Qualcomm Docket No.2400178WO [0080] Another example of features that can be extracted from a voice signal by the voice encoder includes the spectrum of the speech signal and/or the spectrum of the LTP residual. Various estimation techniques can be used to compute the spectrum of the speech signal and/or the LTP residual.
  • a Discrete Fourier transform (DFT), a Fast Fourier Transform (FFT), or other transform of the speech signal can be determined.
  • Quantization techniques that can be used to quantize the spectrum of the voice signal can include vector quantization (single stage or multistage) or predictive/adaptive vector quantization.
  • an autoencoder or other neural network based technique can be used by the voice encoder to quantize the spectrum. Any other suitable spectrum quantization methods can be used.
  • any one of the above-described features or any combination of the above-described features can be estimated, quantized, and sent by the voice encoder to the voice decoding signal synthesis system 300 depending on the particular encoder implementation that is used.
  • the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, and pitch correlation. In another illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the Bark cepstrum of the speech signal. In another illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the spectrum (e.g., DFT, FFT, or other spectrum) of the speech signal.
  • the spectrum e.g., DFT, FFT, or other spectrum
  • the voice encoder can estimate, quantize, and send pitch lag with fractional accuracy, pitch gain, pitch correlation, and the Bark cepstrum of the speech signal.
  • the voice encoder can estimate, quantize, and send pitch lag with fractional accuracy, pitch gain, pitch correlation, and the spectrum (e.g., DFT, FFT, or other spectrum) of the speech signal.
  • the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the Bark cepstrum of the LTP residual.
  • the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the spectrum (e.g., DFT, FFT, or other spectrum) of the LTP residual.
  • the voice decoding signal synthesis system 300 includes a neural network filter estimator 302, a linear time-varying filter 304 generated by the neural network, and a linear predictive coding (LPC) filter 306.
  • the LPC filter 306 can include a time-varying Qualcomm Docket No.2400178WO LPC filter.
  • the neural network filter estimator 302 is trained to generate filter coefficients for the linear time-varying filter 304.
  • the neural network model of the neural network filter estimator 302 can include any neural network architecture that can be trained to model the filter coefficients for the linear time-varying filter 304.
  • Examples of neural network architectures that can be included in the neural network filter estimator 302 include a generative neural network (e.g., a generative-adversarial network (GAN)), convolutional neural networks (CNN), an autoencoder, and/or other type(s) of neural network architectures or models.
  • GAN generative-adversarial network
  • CNN convolutional neural networks
  • autoencoder an autoencoder
  • the voice decoding signal synthesis system 300 e.g., the neural network model of the neural network filter estimator 302
  • the neural network model of the neural network filter estimator 302 can be trained using supervised learning techniques based on backpropagation. For instance, corresponding input and target output pairs can be provided to the neural network filter estimator 302 for training.
  • the input to the neural network filter estimator 302 can include log-Mel- frequency spectrum features or coefficients (e.g., 80 log-Mel features ⁇ , ⁇ 301 shown in FIG.3).
  • the target output (or label or ground truth) for training the neural network filter estimator 302 can include the target speech sample ⁇ 305 for the current time instant n, as shown in FIG.3.
  • a loss 307 will be computed based on the reconstructed sample ⁇ ⁇ ⁇ ⁇ and the target output speech sample ⁇ ⁇ ⁇ ⁇ 305 for time instant n.
  • the target output can include a speech signal ⁇ ⁇ ⁇ ⁇ that is generated after passing the target speech through an LPC analysis filter (inverse of LPC filter 306).
  • the loss will be computed based on the output ⁇ generated by the linear time-varying filter generated by the neural network 304 and the target output ⁇ (e.g., where both are in speech residual domain).
  • Backpropagation can be performed to train the neural network filter estimator 302 using the inputs and the target output.
  • Backpropagation can include a forward pass, a loss function, a backward pass, and a parameter update to update one or more parameters (e.g., weight, bias, or other parameter).
  • the forward pass, loss function, backward pass, and parameter update are performed for one training iteration. The process can be repeated for a certain number of iterations for each set of inputs until the neural network filter estimator 302 is trained well enough so that the weights (and/or other parameters) of the various layers are accurately tuned.
  • training of the neural network filter estimator 302 and/or training of one or more of the machine learning systems or neural networks described herein e.g., such as the decoder model 510-1 with first complexity, decoder model 510- 2 with second complexity, and/or decoder model 510-N with N th complexity, etc., of FIG. 5; any of the decoder model instances 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, and/or 610-8 of FIGS.
  • online may refer to time periods during which the input data (e.g., such as the bitstream 502 of FIG.5, the audio signal of FIG.2A, the speech signal 251 of FIG. 2B, the excitation signal 303 and/or features 301 of FIG. 3, the features S[m,c] of FIG.
  • offline may refer to idle time periods or time periods during which input data is not being processed. Additionally, offline may be based on one or more time conditions (e.g., after a particular amount of time has expired, such as a day, a week, a month, etc.) and/or may be based on various other conditions such as network and/or server availability, etc., among various others.
  • time conditions e.g., after a particular amount of time has expired, such as a day, a week, a month, etc.
  • offline training of a machine learning model can be performed by a first device (e.g., a server device) to generate a pre-trained model, and a second device can receive the trained model from the second device.
  • the second device e.g., a mobile device, an XR device, a vehicle or system/component of the vehicle, or other device
  • the second device can perform online (or on-device) training of the pre-trained model to further adapt or tune the parameters of the model.
  • the forward pass can include passing the input data (e.g., the log- Mel-frequency spectrum features or coefficients, such as the 80 log-Mel features ⁇ , ⁇ 301 shown in FIG.3) through the neural network filter estimator 302.
  • the weights of the neural network model are initially randomized before the neural network filter estimator 302 is trained.
  • the output will likely include values that do not give preference to any particular output due Qualcomm Docket No.2400178WO to the weights being randomly selected at initialization. With the initial weights, the neural network filter estimator 302 is unable to determine low level features and thus cannot make an accurate estimation of the filter coefficients for the linear time-varying filter 304.
  • a loss function can be used to analyze the loss 307 (or error) in the reconstructed or synthesized sample output ⁇ . Any suitable loss function definition can be used.
  • a loss function includes a mean squared error (MSE).
  • MSE is defined as ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , which calculates the sum of one-half times the actual answer minus the predicted (output) answer squared.
  • the loss can be set to be equal to the value of ⁇ ⁇ .
  • Other loss functions may include a difference of magnitude spectrums between the target and output signals, where the difference may be computed as absolute difference, squared difference, or logarithmic difference between the magnitude spectrum of each speech frame, and then aggregated over all speech frames.
  • the loss (or error) will be high for the first training iterations since the actual values will be much different than the predicted output.
  • the goal of training is to minimize the amount of loss so that the predicted output is the same as the training label.
  • the neural network filter estimator 302 can perform a backward pass by determining which inputs (weights) most contributed to the loss of the network, and can adjust the weights so that the loss decreases and is eventually minimized.
  • a derivative of the loss with respect to the weights (denoted as dL/dW, where W represents the weights at a particular layer) can be computed to determine the weights that contributed most to the loss of the network. After the derivative is computed, a weight update can be performed by updating all the weights of the filters.
  • the weights can be updated so that they change in the opposite direction of the gradient.
  • the weight update can be denoted as ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , where w denotes a weight, wi denotes the initial weight, and ⁇ denotes a learning rate can be set to any suitable value, with a high learning rate including larger weight updates and a lower value indicating smaller weight updates.
  • a multi-resolution STFT loss ⁇ ⁇ and adversarial losses ⁇ ⁇ and ⁇ ⁇ can be computed from ⁇ and ⁇ . Because linear time- varying filters are fully differentiable, gradients can propagate back to the neural network filter estimator 302.
  • the linear time-varying filter 304 can process an excitation signal 303 to generate another Qualcomm Docket No.2400178WO signal ⁇ .
  • the signal ⁇ can be used as an excitation signal to excite the LPC filter 306.
  • the linear time-varying filter 304 is a linear filter, which preserves the linearity property between inputs and outputs.
  • a linear filter is associated with a mapping L:RZ ⁇ RZ, ⁇ ⁇ ⁇ ⁇ L ⁇ that has the following property: for any ⁇ , ⁇ ⁇ RZ and any ⁇ , ⁇ ⁇ R, L ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ L ⁇ ⁇ ⁇ L ⁇ .
  • a combined input of varying nature of the linear time-varying filter 304 indicates that the filter response depends on the time of excitation of the linear time-varying filter 304 (e.g., a new set of coefficients used to filter each frame (block of time) of input at the time of excitation).
  • time-varying linear filters can be characterized by the set of impulse responses at each time lag h ⁇ ⁇ L ⁇ ⁇ ⁇ for each ⁇ ⁇ Z.
  • the output of a time varying linear filter is L ⁇ ⁇ ⁇ ⁇ ⁇ h ⁇ (where ⁇ is a convolutional operator) or some heuristic combination of filter input and impulse responses, e.g. overlap-add on windowed and filtered signal segments, etc.
  • the LPC filter 306 can use the signal ⁇ as input to generate the reconstructed or synthesized speech sample ⁇ for the current time instant n.
  • the LPC filter 306 is a linear filter and in some cases is time varying, as defined above with respect to the linear time-varying filter 304.
  • the LPC filter 306 can be a form of time-varying filter used for processing of speech.
  • the LPC filter 306 includes filter coefficients for each speech frame that can be computed using the autocorrelation of a speech or audio signal.
  • the LPC filter 306 can be used to model the spectral shape (or phenome or envelope) of the speech signal.
  • a signal ⁇ can be filtered by an autoregressive (AR) process synthesizer to obtain an AR signal.
  • AR autoregressive
  • a linear predictor can be used to predict the AR signal (which can be denoted as prediction ⁇ ) as a linear combination of the previous m samples as follows: ⁇ ⁇ ⁇ [0092] where the ⁇ ⁇ ... of the AR parameters (also referred to as LP coefficients).
  • a residual signal can be the difference between the original AR signal and the predicted AR signal represented as prediction ⁇ (e.g., the difference Qualcomm Docket No.2400178WO between the actual sample and the predicted sample).
  • the linear prediction coding can be used to find the best linear prediction coefficients for minimizing a quadratic error function, and thus the error.
  • the linear prediction process removes the short-term correlation from the speech signal.
  • the linear prediction coefficients are an efficient way to represent the short-term spectrum of the speech signal.
  • the LPC filter 306 determines the prediction ⁇ for the current sample n using computed or received coefficients and A transfer function ⁇ ⁇ ⁇ . For instance, in some examples, the LPC filter coefficients are received from the encoder.
  • the voice decoding signal synthesis system 300 can derive the LPC filter coefficients, such as using other features (e.g., Mel spectrum features) sent by an encoder to the voice decoding signal synthesis system 300.
  • the voice decoding signal synthesis system 300 can use Mel spectrum features 301 to derive the LPC filter coefficients for the LPC filter 306.
  • the LPC filter 306 can determine the final reconstructed (or predicted) sample ⁇ using the output ⁇ from the linear time- varying filter 304 (for the current sample n).
  • further components can be used along with the neural network filter estimator 302 and the linear time-varying filter 304, such as an impulse train generator 314 and a random noise generator 316.
  • the linear time- varying filter 304 can include a harmonic linear time-varying filter 318 and a noise linear time-varying filter 320.
  • a voice encoder can include a pitch tracker 310 and a feature extraction engine 312.
  • the feature extraction engine 312 can be the same as or similar to the feature generator 200 of FIG. 2A.
  • the original speech signal ⁇ and reconstructed signal ⁇ are divided into non-overlapping frames with frame length L.
  • the term ⁇ can be defined as a frame index
  • the term ⁇ can be defined as a discrete time index
  • the term ⁇ can be defined as a feature index.
  • the total number of frames ⁇ and total number of sampling points ⁇ may follow ⁇ ⁇ ⁇ ⁇ ⁇ .
  • the impulse train generator 314 can generate an impulse train ⁇ from a frame-wise fundamental frequency ⁇ ⁇ ⁇ output by the pitch tracker 310.
  • the impulse train generator 314 can Qualcomm Docket No.2400178WO generate alias-free discrete time impulse trains using additive synthesis. For instance, as illustrated in equation (1) below, the impulse train generator 314 can use a low-passed sum of sinusoids to generate an impulse train: ⁇ 2 ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 2 ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ 1 ⁇ ⁇ [0095] where hold or linear interpolation, ⁇ ⁇ ⁇ / ⁇ , and ⁇ is the sampling rate. In some cases, the computationally complexity of additive synthesis can be reduced with approximations.
  • the impulse train generator 314 or other component (e.g., a processor) of the voice decoding signal synthesis system 300 can round the fundamental periods to the nearest multiples of the sampling period.
  • the discrete impulse train is sparse.
  • the impulse train generator 314 can then generate the impulse train sequentially (e.g., one pitch mark at a time).
  • the pitch tracker 310 can process the input ⁇ for the time instant n to generate the frame-wise fundamental frequency ⁇ ⁇ ⁇ ⁇ ⁇ output, which is provided to and processed by the impulse train generator 314 of the voice decoding signal synthesis system 300.
  • the random noise generator 316 of the voice decoding signal synthesis system 3400 can sample a noise signal ⁇ from a Gaussian distribution.
  • the neural network filter estimator 302 can estimate impulse responses h ⁇ ⁇ , ⁇ and h ⁇ ⁇ , ⁇ for each frame, given the log-Mel spectrogram ⁇ , ⁇ extracted from the input ⁇ by the feature extraction engine 312 of the encoder.
  • complex cepstrums (h ⁇ ⁇ and h ⁇ ⁇ ) can be used as the internal description of impulse responses (h ⁇ and h ⁇ ) for the neural network filter estimator 302.
  • Complex cepstrums describe the magnitude response and the group delay of filters simultaneously. The group delay of filters affects the timbre of speech.
  • the neural network filter estimator 302 can use mixed-phase filters, with phase characteristics learned from the dataset.
  • the length of a complex cepstrum can be restricted, essentially restricting the levels of detail in the magnitude and phase response. Restricting the length of a complex cepstrum can be used to control the complexity of the filters.
  • the neural network filter estimator 302 can predicts low-frequency coefficients, in which the high-frequency cepstrum coefficients can be set to zero. In one illustrative example, two 10 millisecond (ms) long complex cepstrums are predicted in each frame.
  • the neural network filter estimator 302 can use a discrete Fourier transform (DFT) and an inverse-DFT (IDFT) to generate the impulse responses h ⁇ and h ⁇ .
  • the neural network filter estimator 302 can approximate an infinite impulse response (IIR) (h ⁇ ⁇ , ⁇ and h ⁇ ⁇ , ⁇ ) using finite impulse responses (FIRs).
  • the harmonic LTV filter 318 can filter the impulse train ⁇ from the impulse train generator 314 to generate a harmonic component ⁇ ⁇ ⁇ .
  • the noise LTV filter 320 can filter the noise signal ⁇ to generate a noise component ⁇ ⁇ ⁇ .
  • the voice decoding signal synthesis system 300 can combine (e.g., by summing/adding or otherwise combining) the output of the harmonic LTV filter 318 (the harmonic component ⁇ ⁇ ⁇ ) and the output of the noise LTV filter 320 (the noise component ⁇ ⁇ ⁇ ) can be combined (e.g., summed or otherwise combined) to obtain the excitation signal ⁇ .
  • FIG. 4 illustrates an example of a computing system 470 of a wireless device 407.
  • the wireless device 407 may include a client device such as a user equipment (UE) UE or other type of device that may be used by an end-user.
  • the wireless device 407 may include a mobile phone, router, tablet computer, laptop computer, tracking device, wearable device (e.g., a smart watch, glasses, an extended reality (XR) device such as a virtual reality (VR), augmented reality (AR), or mixed reality (MR) device, etc.), Internet of Things (IoT) device, a vehicle, an aircraft, and/or another device that is configured to communicate over a wireless communications network.
  • the wireless device 407 of FIG. 4 may include and/or implement the SoC 100 of FIG. 1.
  • the computing system 470 includes software and hardware components that may be electrically or communicatively coupled via a bus 489 (e.g., or may otherwise be in communication, as appropriate).
  • the computing system 470 includes one or more processors 484.
  • the one or more processors 484 may include one or more CPUs, ASICs, FPGAs, NPUs, DSPs, GPUs, VPUs, NSPs, microcontrollers, hardware accelerators, dedicated hardware, any combination thereof, and/or other processing Qualcomm Docket No.2400178WO device or system.
  • the bus 489 may be used by the one or more processors 484 to communicate between cores and/or with the one or more memory devices 486.
  • the computing system 470 may also include one or more memory devices 486, one or more digital signal processors (DSPs) 482, one or more SIMs 474, one or more modems 476, one or more wireless transceivers 478, an antenna 487, one or more input devices 472 (e.g., a camera, a mouse, a keyboard, a touch sensitive screen, a touch pad, a keypad, a microphone, and/or the like), and one or more output devices 480 (e.g., a display, a speaker, a printer, and/or the like).
  • computing system 470 may include one or more radio frequency (RF) interfaces configured to transmit and/or receive RF signals.
  • RF radio frequency
  • an RF interface may include components such as modem(s) 476, wireless transceiver(s) 478, and/or antennas 487.
  • the one or more wireless transceivers 478 may transmit and receive wireless signals (e.g., signal 488) via antenna 487 from one or more other devices, such as other wireless devices, network devices (e.g., base stations such as eNBs and/or gNBs, Wi-Fi access points (APs) such as routers, range extenders or the like, etc.), cloud networks, and/or the like.
  • the computing system 470 may include multiple antennas or an antenna array that may facilitate simultaneous transmit and receive functionality.
  • Antenna 487 may be an omnidirectional antenna such that radio frequency (RF) signals may be received from and transmitted in all directions.
  • the wireless signal 488 may be transmitted via a wireless network.
  • the wireless network may be any wireless network, such as a cellular or telecommunications network (e.g., 3G, 4G, 5G, etc.), wireless local area network (e.g., a Wi-Fi network), a Bluetooth TM network, and/or other network.
  • the wireless signal 488 may be transmitted directly to other wireless devices using sidelink communications (e.g., using a PC5 interface, using a DSRC interface, etc.).
  • Wireless transceivers 478 may be configured to transmit RF signals for performing sidelink communications via antenna 487 in accordance with one or more transmit power parameters that may be associated with one or more regulation modes. Wireless transceivers 478 may also be configured to receive sidelink communication signals having different signal parameters from other wireless devices.
  • the one or more wireless transceivers 478 may include an RF front end including one or more components, such as an amplifier, a mixer (e.g., also Qualcomm Docket No.2400178WO referred to as a signal multiplier) for signal down conversion, a frequency synthesizer (e.g., also referred to as an oscillator) that provides signals to the mixer, a baseband filter, an analog-to-digital converter (ADC), one or more power amplifiers, among other components.
  • the RF front-end may generally handle selection and conversion of the wireless signals 488 into a baseband or intermediate frequency and may convert the RF signals to the digital domain.
  • the computing system 470 may include a coding-decoding device (or CODEC) configured to encode and/or decode data transmitted and/or received using the one or more wireless transceivers 478.
  • the computing system 470 may include an encryption-decryption device or component configured to encrypt and/or decrypt data (e.g., according to the AES and/or DES standard) transmitted and/or received by the one or more wireless transceivers 478.
  • the one or more SIMs 474 may each securely store an international mobile subscriber identity (IMSI) number and related key assigned to the user of the wireless device 407.
  • IMSI international mobile subscriber identity
  • the IMSI and key may be used to identify and authenticate the subscriber when accessing a network provided by a network service provider or operator associated with the one or more SIMs 474.
  • the one or more modems 476 may modulate one or more signals to encode information for transmission using the one or more wireless transceivers 478.
  • the one or more modems 476 may also demodulate signals received by the one or more wireless transceivers 478 in order to decode the transmitted information.
  • the one or more modems 476 may include a Wi-Fi modem, a 4G (or LTE) modem, a 5G (or NR) modem, and/or other types of modems.
  • the one or more modems 476 and the one or more wireless transceivers 478 may be used for communicating data for the one or more SIMs 474.
  • the computing system 470 may also include (and/or be in communication with) one or more non-transitory machine-readable storage media or storage devices (e.g., one or more memory devices 486), which may include, without limitation, local and/or network accessible storage, a disk drive, a drive array, an optical storage device, a solid- state storage device such as a RAM and/or a ROM, which may be programmable, flash- updateable, and/or the like.
  • Such storage devices may be configured to implement any appropriate data storage, including without limitation, various file systems, database structures, and/or the like.
  • functions may be stored as one or more computer-program products (e.g., instructions or code) in memory device(s) 486 and executed by the one or more processor(s) 484 and/or the one or more DSPs 482.
  • the computing system 470 may also include software elements (e.g., located within the one or more memory devices 486), including, for example, an operating system, device drivers, executable libraries, and/or other code, such as one or more application programs, which may comprise computer programs implementing the functions provided by various aspects, and/or may be designed to implement methods and/or configure systems, as described herein.
  • the frames of data processed by the decoder can be associated with a real-time processing requirement (e.g., a real-time processing time limit) for the decoder to finish decoding and/or processing of a respective frame.
  • a real-time processing requirement e.g., a real-time processing time limit
  • the one or more frames of data can be speech frames, where the real-time processing requirement is equal to the length of each speech frame (e.g., 20ms).
  • the systems and techniques can be used to provide dynamic complexity scaling for a neural network-based decoder, vocoder, and/or signal synthesizer (e.g., speech synthesizer) associated with processing one or more frames of audio data (e.g., speech frames, etc.).
  • a “neural decoder” and/or “neural network-based decoder” may include one or more of a neural network-based audio decoder, a neural network-based voice decoder, a neural network-based speech decoder, a neural network- based vocoder, a neural network-based signal synthesizer, a neural network-based audio synthesizer, a neural network-based voice or speech synthesizer, etc.
  • the neural decoder can be included in a signal synthesis system, such as the signal synthesis system 120 of FIG. 1.
  • the neural decoder can be included in a neural vocoder system and/or a neural homomorphic vocoder (NHV) system.
  • a vocoder can refer to a voice coder that may be configured to perform voice encoding, voice decoding, or both.
  • Vocoders can be used for various voice and speech processing tasks, based on using the vocoder to analyze and/or synthesize a human voice signal.
  • vocoders may be used in the context of telecommunications to Qualcomm Docket No.2400178WO encode speech for transmission over bandwidth-limited and/or bitrate limited channels.
  • a neural vocoder system can implement a voice encoder (e.g., vocoder) using one or more neural network models.
  • Some neural vocoders may be designed to include speech synthesis models to reduce computational complexity and further improve synthesis quality.
  • Many neural vocoder techniques may be based on using a source-filter model to improve source signal modeling. For example, source-filter model neural vocoders use neural networks to generate source signals (e.g., linear prediction residual signals), and offload spectral shaping processing tasks to time-varying filters.
  • neural source-filter (NSF)-based approaches replace linear filters in the classical vocoder model with one or more convolutional neural network- based filters.
  • NSF-based neural vocoders can be used to synthesize waveforms based on filtering a sine-based excitation signal and/or performing neural audio synthesis with one or more sinusoidal models.
  • a neural homomorphic vocoder (NHV) is a type of neural vocoder that can synthesize speech with source-filter models controlled by one or more neural networks.
  • an NHV-based neural vocoder may include one or more neural networks in a source-filter model that can synthesize speech based on filtering impulse trains and noise with linear time-varying (LTV) filters, with the one or more neural networks used to control the LTV filters by estimating complex cepstrums of time- varying impulse responses given acoustic features.
  • LTV linear time-varying
  • Traditional or non-neural vocoders may operate based on decomposing speech into various parameters such as pitch, timbre, rhythm, etc., which can subsequently be manipulated and resynthesized to generate a desired output audio or voice signal.
  • Neural vocoders can apply transformations and manipulations to speech signals directly within a learned feature space of the neural network.
  • the learned feature space used by neural vocoders may capture more complex relationships and characteristics of speech than non-neural network-based signal processing and/or vocoder techniques.
  • neural vocoders can be trained to learn a mapping between raw speech waveforms and the spectral or cepstral representations of the speech.
  • An NHV system can apply one or more homomorphic processing techniques within the learned space corresponding to the mapping.
  • neural vocoders e.g., NHVs, etc.
  • other signal synthesis systems may operate based on a stream of input features that Qualcomm Docket No.2400178WO are generated by an encoder and processed by a decoder.
  • the stream of input features may be transmitted from an encoder of a first system to a decoder of a second system (e.g., the encoder and decoder may be separate).
  • a decoder may receive a stream of input features that are transmitted over a wired or wireless network, over a channel, etc.
  • FIG. 5 is a block diagram illustrating an example of a dynamic decoder complexity switching system 500 that may be used with a decoder of a signal synthesis system, in accordance with some examples.
  • the dynamic decoder complexity switching system 500 can be implemented by and/or included within a neural decoder of a signal synthesis system, NHV system, etc.
  • the dynamic decoder complexity switching system 500 can be implemented by and/or included within a UE or other computing device that implements the neural decoder.
  • the dynamic decoder complexity switching system can be used to provide dynamic complexity switching between a plurality of different decoder model instances or configurations that correspond to different levels of complexity (e.g., processing complexity, synthesis complexity, etc.).
  • a decoder complexity switching engine 550 can be used to switch between respective decoder models of a plurality of decoder models 510-1, 510-2, ..., 510-N.
  • a switch 545 can be provided as a hardware switch, a logical switch, a virtual switch, a software switch, etc., that is configurable to switch an input bitstream 502 to a particular (e.g., selected or configured) decoder model of the plurality of decoder models 510-1, 510-2, ..., 510-N.
  • the switch 545 can be configured to switch the input bitstream 502 to couple to the respective input of one of the plurality of decoder models 510-1, 510-2, ..., 510-N having a complexity level corresponding to a dynamic complexity determination generated by the decoder complexity switching engine 550.
  • the bitstream 502 can be received from an encoder associated with the neural decoder and/or associated with the plurality of decoder models 510-1, 510-2, ..., 510-N.
  • the bitstream 502 can include and/or be indicative of a plurality of features or an encoded representation of audio samples of one or more frames of audio data.
  • the bitstream 502 can include features or an encoded representation for one or more 20ms speech frames, where the features or other encoded Qualcomm Docket No.2400178WO representation are generated by an encoder and transmitted (e.g., over a wired or wireless channel) to the neural decoder (e.g., to the dynamic decoder complexity switching system 500).
  • the bitstream 502 can be the same as or similar to the audio feature vector 208 generated by the feature generator 200 based on the audio signal of FIG.2A. In some examples, the bitstream 502 can be the same as or similar to the output of the voice encoder 252 of FIG.2B generated based on encoding the speech signal 251. In some aspects, the bitstream 502 can be the same as or similar to the input features 301 provided to the NN filter estimator 302 of the NHV system 300 of FIG. 3 (e.g., the bitstream 502 can be the same as or similar to a bitstream of features generated by the feature extraction engine 312 included in the encoder of FIG.3, etc.).
  • the systems and techniques can be used to provide dynamic decoder complexity scaling to finish processing of a respective frame of audio data within a configured processing time limit (e.g., the real-time processing requirement for the frame of audio data, the duration of each respective frame of audio data, etc.).
  • the decoder complexity switching engine 550 can be configured to analyze device load information (e.g., obtained from a device resource monitor 520) and/or decoder processing time measurements or profile information (e.g., obtained from a decoder profiling engine 530).
  • the decoder complexity switching engine 550 can determine a decoder complexity configuration for processing one or more frames (e.g., audio frames) of the input bitstream 502.
  • the decoder complexity configuration determined by the decoder complexity switching engine 550 can be used to control the switch 545 to couple the input bitstream 502 to a particular decoder model of the plurality of decoder models 510-1, 510-2, ..., 510-N.
  • the particular decoder model activated by the switch 545 to receive the input bitstream 502 can be the particular decoder model that is associated with a respective complexity the same as (or most similar to) the decoder complexity configuration determined by the decoder complexity switching engine 550.
  • the decoder complexity switching engine 550 can be used to determine a decoder complexity configuration that corresponds to real-time completion Qualcomm Docket No.2400178WO of the current frame decoding (e.g., completion within the real-time processing requirement or limit for the decoder-side processing of the current frame audio data within input bitstream 502).
  • the decoder complexity switching engine 550 determines a decoder complexity configuration and/or selects a particular decoder model instance of the plurality of decoder model instances 510-1, 510-2, ..., 510-N that guarantees real- time completion of current frame decoding for the input bitstream 502.
  • the decoder complexity switching engine 550 determines a decoder complexity configuration and/or selects a particular decoder model instance of the plurality of decoder model instances 510-1, 510-2, ..., 510-N that has a maximum probability or likelihood of real-time completion of current frame decoding for the input bitstream 502 within the real-time processing requirement (e.g., within the 20ms processing time requirement for 20ms frames of audio data represented within the input bitstream 502).
  • the decoder complexity switching engine 550 can determine a decoder complexity configuration and/or can select a particular decoder model instance of the plurality of decoder model instances 510-1, 510-2, ..., 510-N that increases the probability or likelihood of real-time completion of current frame decoding for the input bitstream 502 within the real-time processing requirement (e.g., within the 20ms processing time requirement for 20ms frames of audio data represented within the input bitstream 502).
  • the decoder complexity switching engine 550 can determine a particular decoder model instance and corresponding decoder complexity level (e.g., decoder complexity configuration) that increases the probability of real-time completion of the current frame decoding over that of a currently or previously configured decoder model instance 510-1, 510-2, ..., 510-N.
  • the increased probability of real- time completion of current frame decoding may be relative to the probability of real-time completion associated with using a default decoder model instance included in the plurality of decoder model instances 510-1, 510-2, ..., 510-N.
  • the decoder complexity switching engine 550 can determine the decoder complexity configuration and/or can select from among the plurality of decoder model instances 510-1, 510-2, ..., 510-N based on analyzing one or more (or both) of device load information obtained from the device resource monitor 520 Qualcomm Docket No.2400178WO and/or decoder processing time measurements obtained from the decoder profiling engine 530.
  • the device resource monitor 520 can determine (e.g., monitor and output to the decoder complexity switching engine 550) device load information corresponding to a UE or other computing device used to implement the neural decoder (e.g., the selected one of the neural decoder instances 510-1, 510-2, ..., 510-N) and the dynamic decoder complexity switching system 500 of FIG.5.
  • the device load information can be obtained and/or determined by the device resource monitor 520, which may be configured to monitor the hardware resource utilization of a UE or other computing device used to implement the neural network-based decoder.
  • the device load information can include a current central processing unit (CPU) usage or utilization percentage and/or can include a current usage or utilization percentage for various other processors included in the device.
  • the processor utilization information may be instantaneous utilization information corresponding to the current state of the CPU or other processor.
  • the processor utilization information can be an average utilization percentage over a configured time window.
  • the device load information can include or otherwise be indicative of an average CPU usage over the previous 5ms, 10ms, 15ms, etc.
  • the device load information determined, monitored, and/or obtained by the device resource monitor 520 can include a memory (e.g., random-access memory (RAM), processor cache, video RAM (VRAM), etc.) usage of the UE or other computing device used to implement the neural network-based decoder.
  • the memory usage information can be indicative of an instantaneous memory percentage in use, an average memory percentage used over a configured time window (e.g., average memory usage over the previous 5ms, 10ms, 15ms, etc.), and/or a combination thereof.
  • the device resource monitor can obtain CPU, processor, and/or memory usage information from a resource statistics engine associated with the UE or other computing device and/or an operating system or firmware thereof.
  • the CPU, processor, and/or memory usage information can include or comprise CPU, processor, and/or memory (respectively) resource statistics information representing an instantaneous utilization, an aggregate utilization, an average utilization over one or more configured time windows, etc.
  • Qualcomm Docket No.2400178WO [0130]
  • the device load information obtained by the device resource monitor 520 can be monitoring information corresponding to one or more hardware components of a UE or other computing device used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG. 5.
  • the device resource monitor 520 can be configured to obtain CPU resource statistics, metrics, usage information, history information, performance information, etc., based on transmitting one or more respective queries to corresponding CPU resources and/or corresponding communication interfaces thereof (e.g., included in and/or implemented by the UE or other computing device associated with the neural decoder and the dynamic decoder complexity switching system 500).
  • the device resource monitor 520 can determine resource load information corresponding to one or more DSP engines or DSP cores of a UE or other computing device used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG. 5.
  • the device load information can include the DSP load and/or DSP resource monitoring information obtained responsive to the one or more DSP resource queries.
  • the device resource monitor 520 can transmit one or more DSP core queries to a DSP core and/or a DSP core communication interface.
  • the DSP resource queries can be transmitted from the device resource monitor 520 to a DSP OS or other DSP client associated with the DSP or DSP core(s).
  • the DSP core queries can correspond to DSP core load information.
  • the device resource monitor 520 can query a DSP core frequency and/or a DSP load from the DSP OS.
  • the DSP load information may correspond to MCPS (millions of clocks per second) information, MPPS (millions of packets per second) information, and/or CPP information obtained from the DSP OS and/or client.
  • the device resource monitor 520 can transmit one or more queries of DDR memory bandwidth and/or real-time DDR memory clock information from a memory subsystem of the UE or other computing device used to implement the neural decoder and/or dynamic decoder complexity switching system 500.
  • the memory subsystem may be associated with and/or accessed through a corresponding client and/or through a dedicated communication interface or bus corresponding to the memory subsystem.
  • the memory subsystem can be associated with and can provide Qualcomm Docket No.2400178WO responsive to the memory load query information indicative of DDR memory utilization, DDR memory bandwidth, etc.
  • the device resource monitor 520 may determine the device load information based on querying bus active clock frequency information from a DSP OS (e.g., a client interface to the DSP OS, etc.).
  • the bus active clock frequency information can be associated with a bus clock of the UE or other computing device.
  • the device resource monitor 520 can be configured to aggregate some (or all) of the resource information obtained from a CPU, NPU, DSP core, DDR bandwidth or memory subsystem, bus clock, etc., and may transmit the aggregated information to the decoder complexity switching engine 550 as the instantaneous and/or time window-averaged device load information.
  • the device load information can be obtained from a device resource monitor or device resource manager that is included in and/or implemented by the UE or other computing device that implements the neural network-based decoder.
  • the device load information can be obtained from a performance monitoring or profiling library included in and/or implemented by an operating system of the device that implements the neural decoder.
  • the device resource monitor 520 can obtain and/or determine device load information based on inspecting a /proc/stat special file maintained by the operating system or kernel of the UE or computing device used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG. 5. In some cases, the device resource monitor 520 can be configured to utilize a perf_events system library to obtain and/or determine the device load information.
  • the decoder profiling engine 530 can be used to obtain (e.g., monitor and output to the decoder complexity switching engine 550) decoding time information corresponding to the total elapsed decoding time for completion of decoding processing for one or more previously decoded frames processed by the neural decoder (e.g., previously decoded frames processed by respective ones of the plurality of decoder model instances 510-1, 510-2, ..., 510-N).
  • the neural decoder e.g., previously decoded frames processed by respective ones of the plurality of decoder model instances 510-1, 510-2, ..., 510-N).
  • the decoder measurement and profiling engine 530 may obtain decoding time measurement information for the previously decoded frames of the input bitstream 502, and/or can be used to obtain decoder implementation profiling information corresponding to one or more of the plurality of decoder model instances 510-1, 510-2, ..., 510-N.
  • the decoder implementation may include (e.g., and the decoder profiling engine 530 may utilize) one or more system calls configured to obtain a decoder start time (e.g., decoder start wall time) for one or more previously decoded frames of the bitstream 502.
  • one or more system calls can be configured to obtain a decoder end time (e.g., decoder end wall time) for the one or more previously decoded frames of the bitstream 502.
  • the obtained decoder end time(s) may correspond to respective decoder start times.
  • the decoder profiling engine 530 can determine a total elapsed processing time (e.g., a total elapsed decoding time) of the neural decoder in performing decoding processing for the respective one of the previously decoded frames.
  • the decoder profiling information obtained by the decoder profiling engine 530 and provided to the decoder complexity switching engine 550 can be indicative of a total decoder processing time elapsed (e.g., wall time elapsed to decode previous frame(s)) to complete respective decoding of each previously decoded frame of a plurality of previously decoded frames.
  • a total decoder processing time elapsed e.g., wall time elapsed to decode previous frame(s)
  • the decoder complexity switching engine 550 can be used to provide dynamic decoder complexity scaling based on configuring the switch 545 to switch between coupling the input bitstream 502 (e.g., the input to the neural decoder for decoding or signal synthesis) to a particular (e.g., selected) one of the plurality of different decoder instances with different complexities (e.g., decoder 510-1 with a first synthesis complexity, decoder 510-2 with a second synthesis complexity, ..., decoder 510-N with an N th synthesis complexity, etc.).
  • the input bitstream 502 e.g., the input to the neural decoder for decoding or signal synthesis
  • a particular (e.g., selected) one of the plurality of different decoder instances with different complexities e.g., decoder 510-1 with a first synthesis complexity, decoder 510-2 with a second synthesis complexity, ..., decoder 510-N with an N th synthesis complexity
  • the complexity scaling performed and/or configured or controlled by the decoder complexity switching engine 550 can be based on device load information obtained from the device resource monitor 520, can be based on decoder measurement or profiling information from the decoder profiling engine 530 (e.g., including a decoder start time, decoder end time, decoder total elapsed processing time, etc., for one or more previously decoded frames, etc.), and/or can be based on both device load information from the device resource monitor and decoder measurement information from the decoder profiling engine 530.
  • decoder measurement or profiling information from the decoder profiling engine 530 e.g., including a decoder start time, decoder end time, decoder total elapsed processing time, etc., for one or more previously decoded frames, etc.
  • device load information and/or decoding time measurement information can be used (e.g., by the decoder complexity switching engine 550) to dynamically reduce the neural network-based decoder or synthesizer engine complexity based on a determination of a relatively high current device load state.
  • neural decoders e.g., such as the neural decoder model instances 510-1, 510-2, ..., 510- N
  • a relatively high device load may be associated with the device performing multiple tasks simultaneously or in parallel.
  • the device may be utilizing a portion of its computational resources and/or computational capacity for additional tasks such as a user browsing or consuming other media content in parallel to a voice call, audio call, video call, etc., that is associated with one or more speech frames within the bitstream 502 to be decoded by the neural decoder.
  • the device may be used to perform a video call, and the bitstream 502 includes speech frame features (or other encoded representations of speech frames or speech signals) corresponding to the ongoing video call.
  • the relatively high device load may be associated with the device simultaneously performing (e.g., performing in parallel) video decoding for the frames of video data of the video call, and performing audio decoding (e.g., speech or voice decoding) for the frames of audio data (e.g., frames of speech or voice data) that are also included in the video call and/or are associated with the frames of video data.
  • a respective neural decoder model may be associated with a frame processing completion time that is within the configured real-time requirement or real- time limit associated with the frame type.
  • each of the neural decoder model instances 510-1, 510-2, 510-N may be associated with a speech frame processing completion time that is within the real-time requirement for decoding speech frames (e.g., each model may be able to decode a received speech frame in an amount of time less than or equal to the length of each speech frame and/or the periodicity between the decoder receiving consecutive speech frames in the bitstream Qualcomm Docket No.2400178WO 502).
  • each of the neural decoder model instances 510-1, 510-2, ..., 510-N may be able to complete speech frame processing for a respective speech frame within a 20ms real-time processing requirement that is equal to the length of each speech frame and/or is equal to the periodicity between consecutive speech frames included in the input bitstream 502 to the neural decoder.
  • the neural decoder model instances 510-1, 510-2, ..., 510-N may be associated with different complexities, and may each correspond to a different amount of device load or processing power for completing processing of a respective speech frame.
  • one or more of the neural decoder model instances 510-1, 510-2, ..., 510-N may be unable to complete processing (e.g., decoding) of a respective frame of the input bitstream 502 within the corresponding real-time processing requirement or limit for that frame.
  • a relatively high device load may correspond to fewer processing resources being made available to implement and run the neural decoder model instance, which can cause an increased (e.g., longer) frame processing time.
  • the decoder complexity switching engine 550 can be used to dynamically increase the neural network-based decoder or synthesizer engine complexity based on a determination of a relatively low current device load state, based on the decoder complexity switching engine 550 analyzing the device load information from the device resource monitor 520. In some aspects, the decoder complexity switching engine 550 can be used to dynamically decrease the neural network-based decoder complexity based on a determination of a relatively high current device load state. Switching to a lower complexity neural decoder model instance can be performed to ensure real-time processing of the incoming frames of audio data (e.g., within bitstream 502) is still achieved.
  • a lower complexity neural decoder model instance may be used to complete audio frame decoding within the 20ms real-time requirement when the device load is relatively high and the neural decoder is able to use relatively few processing resources of the device.
  • a higher complexity neural decoder model instance may be used to complete audio frame decoding within the 20ms real-time requirement with greater or improved accuracy or quality, based on the higher complexity neural decoder model instance being able to utilize relatively larger amounts of processing resources of the device when the overall device load is relatively low.
  • the device load information can be used to dynamically reduce or increase the decoder complexity so that the decoder remains within the configured time limit for processing (e.g., decoding, synthesizing, etc.) each respective frame.
  • the device load information can be used by the decoder complexity switching engine 550 to configure or control the switch 545 to implement a particular selection of a decoder model instance and complexity out of the plurality of different decoder model instances and corresponding complexities available at the device (e.g., out of the plurality of decoder model instances 510-1, 510-2, ..., 510-N, etc.), to thereby dynamically reduce or increase the decoder complexity so that the decoder remains within a 20ms real-time processing requirement associated with a stream of 20ms speech frames provided to the decoder in the input bitstream 502.
  • the decoder complexity switching engine 550 can select a lower complexity neural decoder model instance 510-1, 510-2, ..., 510-N based on a determination that the current device load is relatively high and relatively fewer processing resources will be available to process the next frame of audio data using one of the neural decoder model instances 510-1, 510-2, ..., 510-N.
  • the decoder complexity switching engine 550 can select a lower complexity neural decoder model instance 510-1, 510-2, ..., 510-N based on a determination that the previously decoded frame (and/or multiple previously decoded frames) exceeded the real-time processing requirement or limit (e.g., one or more previous frames took longer than the 20ms speech frame real-time processing limit to complete processing and decoding).
  • switching to a lower complexity neural decoder model instance may be associated with an output quality degradation of the reconstructed or synthesized signal generated by the neural decoder.
  • a graceful quality drop may be preferrable to the severe degradation and/or output signal discontinuity that may otherwise be caused by not switching to the lower complexity neural decoder model instance and violating (e.g., exceeding) the real time decoding time limit by continuing to use the relatively higher complexity neural decoder model instance.
  • the decoder complexity switching engine 550 can select a higher complexity neural decoder model instance 510-1, 510-2, ..., 510-N based on a determination that the current device load is relatively low and relatively greater processing resources will be available to process the next frame of audio data using one Qualcomm Docket No.2400178WO of the neural decoder model instances 510-1, 510-2, ..., 510-N.
  • the decoder complexity switching engine 550 can select a higher complexity neural decoder model instance 510-1, 510-2, ..., 510-N based on a determination that the previously decoded frame (and/or multiple previously decoded frames) completed decoding processing before the real-time processing requirement or limit (e.g., one or more previous frames took less than the 20ms speech frame real-time processing limit to complete processing and decoding). In some cases, the decoder complexity switching engine 550 can increase the neural decoder model complexity based on a determination that one or more previous frames completed decoding processing early by an amount greater than or equal to a configured threshold.
  • the decoder complexity switching engine 550 can be configured to increase the neural decoder complexity based on one or more previous frames completing processing more than 10% ahead of the real- time processing requirement limit time (e.g., more than 2ms ahead of the 20ms real-time processing requirement limit time for 20ms speech or audio frames, etc.).
  • the decoder complexity switching engine 550 can monitor and/or analyze in real-time the device load information obtained by device resource monitor 520 and/or the previous frame processing time information obtained from the decoder profiling engine 530.
  • the decoder complexity switching engine 550 can be configured to generate an output indicative of a complexity scaling adjustment corresponding to increased decoder complexity (e.g., a selection of a decoder model instance 510-1, 510-2, ..., 510-N with a greater complexity than the decoder used to process the previous frame or one or more previous frames), a scaling adjustment corresponding to decreased decoder complexity (e.g., a selection of a decoder model instance 510-1, 510-2, ..., 510-N with a lesser complexity than the decoder used to process the previous frame or previous one or more frames), and/or a scaling adjustment corresponding to the same decoder complexity.
  • a complexity scaling adjustment corresponding to increased decoder complexity e.g., a selection of a decoder model instance 510-1, 510-2, ..., 510-N with a greater complexity than the decoder used to process the previous frame or one or more frames
  • a scaling adjustment corresponding to decreased decoder complexity e
  • maintaining the same decoder complexity for the current frame as the previous frame can correspond to utilizing the same neural decoder model instance 510-1, 510-2, ..., 510-N.
  • the plurality of neural decoder model instances 510-1, 510-2, ..., 510-N may include multiple neural decoder model instances that are different from one another but have the same complexity, and maintaining the same decoder complexity can be associated with Qualcomm Docket No.2400178WO using the switch 545 to switch between different neural decoder model instances within the same level of complexity.
  • a complexity scaling indication can be generated and/or determined by the decoder complexity switching engine 550, and may be used to configure the switch 545 coupled between the input bitstream 502 received at the neural decoder system and the plurality of different neural decoder model instances 510-1, 510- 2, ..., 510-N, where each decoder model instance of the plurality of decoder model instances is associated with a respective synthesis complexity.
  • each decoder model instance of the plurality of decoder model instances 510-1, 510-2, ..., 510- N is associated with a different respective synthesis complexity.
  • FIG. 6A is a diagram illustrating a plurality of example neural decoder architectures 600 each associated with a respective complexity, in accordance with some examples.
  • FIG. 6B is a diagram illustrating additional example neural decoder architectures 650 each associated with a respective complexity, in accordance with some examples.
  • the example neural decoder architectures 600 of FIG.6A and 650 of FIG.6B may be used to implement one or more neural decoder model instances of the plurality of neural decoder model instances 510-1, 510-2, ..., 510-N of FIG. 5.
  • one or more of the neural decoder model instance 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, and/or 610-8 of FIGS. 6A and 6B may be used to implement one or more neural decoder model instances of the plurality of neural decoder model instances 510-1, 510-2, ..., 510-N of FIG.5.
  • the systems and techniques can implement dynamic decoder complexity scaling to increase or decrease a neural decoder complexity while remaining within a real-time processing limit associated with processing (e.g., decoding) one or more input frames received at the decoder.
  • the neural decoder complexity scaling can be implemented based on a selection of a particular neural decoder model instance with a particular complexity, where the selection is made from a plurality of neural decoder model instances associated with different respective complexities.
  • the different decoder model instances can implement different complexities (e.g., synthesis complexity, decoding complexity, processing or processing time complexity, etc.) based on the respective decoder model instances utilizing different neural network and/or machine learning architectures.
  • a first neural network architecture may be used to implement a first level of complexity
  • a second neural network architecture may be used to implement a second level of complexity, etc.
  • the 6A may be associated with different respective complexities, based on the first neural decoder 610-1 having a different neural network architecture than that of the second neural decoder 610-2.
  • the first neural decoder architecture 610-1 may be associated with a higher complexity than the second neural decoder architecture 610-2.
  • the first neural decoder architecture 610-1 may be associated with a lesser complexity than the second neural decoder architecture 610-2.
  • the first neural decoder architecture 610-1 and the second neural decoder architecture 610-2 may be different from one another, but may be associated with the same complexity.
  • the different decoder model instances can implement different complexity based on the respective decoder model instances utilizing different neural network or machine learning weights or parameters.
  • a first set of neural network weights can be used to parameterize a neural network architecture to a first level of complexity
  • a second set of neural network weights can be used to parameter a same or different neural network architecture to a second level of complexity, etc.
  • the third neural decoder model instance 610-3 of FIG.6A utilizes a set of weights wi
  • the fourth neural decoder model instance 610-4 of FIG.6A utilizes a set of weights wk that is different from the set of weights wi.
  • the third neural decoder 610-3 and fourth neural decoder 610-4 may utilize the same neural network architecture, parameterized by the respective different sets of weight values wi and wk.
  • the neural decoder architecture 610-3 may be associated with a higher complexity than the neural decoder architecture 610-4.
  • the neural decoder architecture 610-3 may be associated with a lesser complexity than the neural decoder architecture 610-4.
  • the different complexity decoder model instances may share one or more neural network layers and can scale in complexity based on combining the Qualcomm Docket No.2400178WO shared neural network layers with one or more complexity level-specific subsets of corresponding neural network layers.
  • a first complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the first complexity level.
  • a second complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the second complexity level, etc.
  • the neural decoder model instance 610-5 of FIG.6A represents a first configuration using a combination of a set of shared neural network layers and a subset of complexity-specific neural network layers “NN Layers A.” Additional subsets of complexity-specific neural network layers “NN Layers B”, ..., “NN Layers N” may be included in the neural decoder and associated with the shared NN layers, but are not activated or used for the Complexity A configuration.
  • 6A can be the same as or similar to the neural decoder model instance 610-5, with a second configuration of neural network layers implemented to combine the same set of shared neural network layers with the second subset of complexity-specific neural network layers (“NN Layers B”).
  • the first NN Layers A complexity-specific subset, ..., and the N th NN Layers N complexity- specific subset are deactivated or not used, while the second NN Layers B complexity- specific subset is selected and configured as the active subset of complexity-specific neural network layers.
  • one or more lower complexity neural decoder model instances can be implemented using respective subsets of a plurality of neural network layers included in a highest complexity neural decoder model instance associated with the UE or other computing device.
  • one or more neural network layers and/or subsets of neural network layers of the highest complexity decoder model instance can be deactivated and bypassed to implement lower complexity level decoder model instances.
  • lower complexity decoders can be implemented by bypassing one or more neural network layers, subsets of neural network layers, components, processing steps, etc., of the highest complexity decoder model instance at the device.
  • the highest complexity neural decoder 610-7 can include one or more processing blocks and/or groups or subsets of neural network layers, where respective processing blocks and/or subsets or groups of neural network layers may be deactivated and bypassed to reduce the decoder complexity and implement a reduced complexity decoder model instance.
  • the neural decoder 610-7 includes a feature dequantization engine 642, a feature upsampling engine 644, and a neural network synthesis engine 646.
  • the decoding processing pipeline can include each of the feature dequantization engine 642, the feature upsampling engine 644, and the neural network synthesis engine 646.
  • the neural decoder model instance 610-8 can be lower complexity than the higher complexity neural decoder 610-7, and may be implemented based on deactivating one or more of the processing blocks and/or subsets of neural network layers that are activated in the higher complexity neural decoder 610-7. For example, the complexity of the neural decoder 610-7 can be reduced based on deactivating, within the lower complexity model instance 610-8, the neural network layers corresponding to the feature upsampling engine 644. [0162] FIG.
  • FIG. 7 is a diagram illustrating an example of a dynamic decoder complexity scaling system 700 that can be used to implement a dynamic decoder complexity adjustment based on determining the currently elapsed execution time at one or more checkpoint during processing of a current frame (e.g., based on determining the elapsed time at one or more points in time prior to completion of the processing of the current frame).
  • the dynamic decoder complexity adjustment can be based on determining the currently elapsed execution time (e.g., from the beginning of processing for the current frame) at one or more checkpoints corresponding to different layers and/or subsets or groups of layers of a plurality of layers included in the neural decoder model.
  • a neural decoder model of FIG.7 may include a plurality of neural network layers, including a first subset of decoder NN layers 715-1, a second subset of decoder NN layers 715-2, ..., and an N th subset of decoder NN layers 715-N.
  • the subsets of decoder NN layers 715-1, 715-2, ..., 715-N may be included in a respective neural decoder model instance of a plurality of neural decoder model instances, Qualcomm Docket No.2400178WO such as the plurality of neural decoder model instances 510-1, 510-2, ..., 510-N of FIG.
  • the systems and techniques can be configured to monitor the execution time of processing and/or decoding of the current frame with increased granularity during the current frame execution. For example, after the current frame data is processed by the first subset 715-1 of NN layers, the elapsed execution time can be determined at block 730-1.
  • the elapsed execution time determined at block 730-1 may correspond to the elapsed time between beginning processing of the current frame data by the neural decoder (e.g., beginning processing with the first subset 715-1 of NN decoder layers) and completing processing by the first subset 715-1 of NN decoder layers.
  • the systems and techniques can determine a remaining time budget based on the elapsed execution time determined at block 730-1.
  • the remaining time budget can be based on the difference between the configured real-time requirement or real-time processing limit for decoding each frame by the neural decoder (e.g., 20ms limit for a 20ms frame duration, etc.), and the elapsed execution time at block 730-1.
  • the remaining time budget can be indicative of a remaining amount of time before the current frame execution (e.g., current frame decoding processing) equals or exceeds the configured real-time processing limit duration.
  • the systems and techniques can compare the remaining time budget to an estimated processing time associated with processing the current frame data by each of the remaining NN decoder layer subsets 715-2, ..., 715-N.
  • the estimated processing time can include an estimated processing time for each respective subset of the remaining NN decoder layer subsets 715-2, ..., 715-N. In some examples, the estimated processing time can include an aggregate processing time that is estimated as a single value for the one or more remaining NN decoder layer subsets 715-2, ..., 715-N that have not yet begun or completed execution to process the current frame data. [0167] In some aspects, the determination at block 740-1 can compare the remaining time budget to one or more configured thresholds. For example, block 740-1 may determine if the remaining time budget is greater than 0 (e.g., the real-time processing requirement limit has not yet been reached or exceeded).
  • subsequent NN decoder layer subsets may be skipped, to expedite processing of the current frame data to remain within the real-time processing limit or to minimize the duration by which the real-time processing limit is exceeded for the current frame.
  • the remaining time budget can be used to determine whether the neural decoder should continue with processing of the current frame data using the next subset of NN decoder layers in the sequential order, or if the neural decoder should skip one or more subsequent subsets of NN decoder layers and/or exit early.
  • the remaining time budget can be used to determine whether the neural decoder should continue to process the current frame data using the sequential order of each subset of NN decoder layers associated with the processing pipeline that includes all of the NN decoder layer subsets 715-1, 715-2, ..., 715-N, or if insufficient time remains to perform the full processing.
  • processing can proceed from the subset 1 decoder NN layers 715- 1 to the subset 2 decoder NN layers 715-2 based on a determination at block 740-1 that the estimated time remaining is less than or equal to the remaining time budget (e.g., processing can be completed by all decoder NN layer subsets within the remaining time budget and within the real-time processing limit for the current frame).
  • block 740-1 may correspond to a determination that the estimated time remaining is less than the remaining time budget (e.g., the remaining time budget is insufficient to perform full processing using each subset of NN decoder layers while remaining within and not exceeding the real-time processing requirement time limit of 20ms).
  • one or more subsequent NN decoder layer subsets 715-2, ..., 715-N can be skipped based on the remaining time budget being less than the estimated time to complete processing.
  • the second subset 715-2 of NN decoder layers can be skipped, and processing can proceed to the final (e.g., output layer) subset N of decoder NN layers 715-N.
  • all remaining NN decoder layers subsets may be skipped based on the remaining time budget being less than the estimated time to complete processing.
  • next NN decoder layer subset can be skipped, and the elapsed execution time can be compared again to an updated remaining time budget after the next NN decoder layer subset processing checkpoint is reached.
  • the elapsed execution time can be determined after each NN Qualcomm Docket No.2400178WO decoder layer subset.
  • block 730-2 can be the same as or similar to block 730-1, where block 730-2 is associated with determining the elapsed execution time for processing of the current frame data after finishing processing by the second NN decoder layer subset 715-2.
  • block 740-2 can be the same as or similar to block 740-1, where block 740-2 determines an updated time remaining budget corresponding to a decoder processing checkpoint that is after the first NN decoder layer subset 715-1 and the second NN decoder layer subset 715-2, and before the remaining NN decoder layer subsets 715-3, ..., 715-N.
  • the elapsed execution time determinations of block 730-1 and/or 730-2 can be performed after completing processing for each subset of NN decoder layers 715-1, 715-2, ..., etc., prior to the final (e.g., output) subset of NN decoder layers 715-N.
  • the elapsed execution time determinations of block 730-1 and/or 730-2 can be performed after completing processing for each subset of NN decoder layers that is immediately prior to a skippable subset of NN decoder layers, where the determined elapsed execution time is used to determine whether or not the skippable subset of NN decoder layers will be skipped or will be used to process the current frame data.
  • the elapsed execution time determinations of block 730-1 and/or 730-2 can be performed using the decoder profiling engine 530 of FIG.5.
  • the respective elapsed execution time measurements corresponding to particular intermediate NN decoder layers or intermediate subsets of NN decoder layers between the first (e.g., input) subset 715-1 and the last (e.g., output) subset 715-N of decoder NN layers can be obtained by the decoder profiling engine 530 and/or can be included in decoder measurement time information or decoder profile information obtained by or associated with the decoder profiling engine 530 of FIG. 5.
  • the determination of whether to skip one or more subsequent NN decoder layers or subsets of NN decoder layers can be associated with and/or performed by the decoder complexity switching engine 550 of FIG.5.
  • the systems and techniques described herein may be self-contained and/or implemented by a single UE or other mobile computing device Qualcomm Docket No.2400178WO associated with an audio call and/or audio frame processing (e.g., a voice-only call, a video call including video frames and audio frames, etc.).
  • the UE or mobile computing device may be unable to plan for or control the device load associated with the device load information and device resource monitor 520 of FIG. 5.
  • the device load, CPU and/or other processor utilization, memory utilization, etc. may be factors that a voice call application executing the neural decoder and dynamic decoder complexity scaling system 500 on the UE cannot control or schedule.
  • the dynamic decoder complexity scaling system 500 can be used to react in real-time to the observed device load information, CPU clock frequency or utilization, etc., that exist for the UE at the time a current frame begins processing in the neural decoder and/or at the time a current frame is schedule to being processing in the neural decoder.
  • the complexity scaling for the neural decoder and the selection of a particular neural decoder model instance e.g., from the plurality of different neural decoder model instances 510-1, 510-2, ..., 510-N associated with different respective complexities, can be performed for each frame of a plurality of frames received by the neural decoder in the bitstream 502.
  • the decoder complexity may be scaled down to a lower complexity neural decoder model instance for one or more frames until the device load decreases. Based on the device load decreasing, the neural decoder complexity can be scaled up to a higher complexity neural decoder model instance in response.
  • the systems and techniques described herein can be used to provide dynamic complexity adjustment for one or more of a decoder and/or an encoder included in an audio codec system, such as the audio codec system 800 of FIG. 8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG.10, etc.
  • a decoder 810 of the audio codec system 800 of FIG. 8 can be included in a decoder 810 of the audio codec system 800 of FIG. 8, a decoder 910 of the audio codec system 900 of FIG.9, a decoder 1010 of the audio codec system 1000 of FIG.10, etc.
  • one or more of the device resource monitor 520 of FIG.5 and/or the decoder profiling engine 530 can be included in the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc.
  • one or more of the decoder 810 of the audio codec system 800 Qualcomm Docket No.2400178WO of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc., can be included in the plurality of decoders with different synthesis complexity 510-1, 510-2, ..., 510-N of FIG.5.
  • one or more of the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc. can be included in the plurality of decoders with different synthesis complexity 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, 610- 8 of FIGS.6A-6B.
  • FIG.8 is a block diagram illustrating an example of an audio codec system 800 that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer.
  • the audio codec system 800 can also be referred to as a voice coding system (e.g., vocoder), a signal synthesis system, a voice coding signal synthesis system, etc.
  • the audio codec system 800 can be a generative audio codec (e.g., a generative voice codec) including one or more feedback recurrent autoencoders (FRAEs) that may be used to generate encoded audio data, and one or more neural synthesizers that may be used to generate reconstructed audio data based on the encoded audio data.
  • FRAEs feedback recurrent autoencoders
  • the audio codec system 800 can include an encoder 805 (e.g., a transmitter) and a decoder 810 (e.g., a receiver).
  • the encoder 805 can be used to generate encoded features that are transmitted to the decoder 810 over a channel 840.
  • the encoder 805 can receive audio 815 and generate encoded features corresponding to the audio 815.
  • the encoded features can be generated using a feedback recurrent autoencoder (FRAE) 825 included in the encoder 805.
  • FRAE feedback recurrent autoencoder
  • the encoder 805 can include one or more FRAEs, where each FRAE of the one or more FRAEs is configured to generate a corresponding one or more encoded features associated with the audio 815.
  • the audio 815 may include and/or comprise a speech signal, a voice signal, etc.
  • the decoder 810 can receive encoded features (e.g., from the encoder 805, over the channel 840) and generate reconstructed audio 855 based at least in part on the encoded features received from the encoder 805.
  • the reconstructed audio 855 may include and/or comprise a reconstructed speech signal, a reconstructed voice signal, etc.
  • the reconstructed audio 855 can be generated using a neural network-based speech Qualcomm Docket No.2400178WO synthesizer 850 (e.g., also referred to as a neural speech synthesizer and/or a neural synthesizer) included in the decoder 810.
  • the reconstructed audio 855 generated using the neural speech synthesizer 850 may also be referred to as synthesized speech.
  • the decoder 810 can include an FRAE decoder 845, which can be used to decode the encoded features received from the encoder 805.
  • the FRAE decoder 845 can be the same as or similar to a decoder implemented by the FRAE autoencoder 825 included in the encoder 805.
  • the decoder 810 can include one or more FRAE decoders 845, which can correspond to one or more FRAE autoencoders 825 included in the encoder 805.
  • the number of FRAE decoders 845 included in the decoder 810 can be equal to the number of FRAE autoencoders 825 included in the encoder 805.
  • the audio 815 includes speech.
  • the audio 815 may be an example of the speech signal 251 of FIG.2B, and/or the speech signal 303 of FIG.3, etc.
  • the encoder 805 of the generative voice codec system 800 can be configured to generate one or more encoded representations of the input audio 815.
  • the one or more encoded representations can include spectral envelope information associated with the audio 815, and/or can include pitch information associated with the audio 815, etc.
  • the encoder 805 can extract spectral envelope features from the audio 815 using spectral envelope feature extraction 820, and can encode the extracted spectral envelope features using the FRAE 825.
  • the encoded spectral features z can be transmitted from the FRAE 825 to the decoder 810, using the channel 840.
  • the encoder 805 can extract pitch information from the audio 815 using pitch extraction engine 830, and can perform quantization 835 to generate quantized pitch information associated with the audio 815.
  • the quantized pitch information can be transmitted from the pitch quantizer 835 to the decoder 810, using the channel 840.
  • the encoder 805 can be configured to extract spectral features (e.g., spectral envelope features) from the audio 815 using the spectral envelope feature extraction engine 820.
  • the spectral envelope feature extraction engine 820 can perform a cepstrum computation to determine cepstrum information and/or one or more cepstral coefficients corresponding to the audio 815.
  • the spectral features extracted from the audio 815 using the spectral envelope feature extraction 820 can include mel-frequency cepstral coefficients (MFCC), such as MFCC- 24 features (e.g., 24-dimensional MFCC features).
  • MFCC mel-frequency cepstral coefficients
  • the spectral Qualcomm Docket No.2400178WO envelope feature extraction 820 applies one or more mel-scaled filter(s) (e.g., a mel-scaled filterbank), a logarithmic compression, and/or a discrete cosine transform (DCT) to the audio 815 (e.g., and/or to a magnitude spectrum associated with the audio 815).
  • mel-scaled filter(s) e.g., a mel-scaled filterbank
  • DCT discrete cosine transform
  • the encoder 805 processes the extracted spectral features using a feedback recurrent autoencoder (FRAE) 825 to generate encoded features z.
  • the FRAE 825 includes a decoder and an encoder.
  • the FRAE 825 can implement feedback of state information h between the decoder and the encoder included in the FRAE 825.
  • the state information h can be determined or obtained at the decoder of the FRAE 825, and feedback of the state information h can be performed for the decoder of the FRAE 825 (e.g., the state information h is fed back to the decoder of the FRAE 825) and for the encoder of the FRAE 825 (e.g., the state information h is fed back from the decoder of the FRAE 825 to the encoder of the FRAE 825).
  • the spectral envelope feature extraction engine 820 can be configured to generate and/or determine (e.g., extract) one or more types of spectral envelope features.
  • the extracted spectral envelope features may include cepstrum information and/or cepstral coefficients.
  • cepstral liftering can be performed based on DCT truncation to exclude or remove pitch information and only capture spectral envelope information in the extracted cepstrum or cepstral coefficients.
  • the extracted spectral envelope features can include companded (e.g., log) filterbank energies associated with the input audio 815.
  • companded filterbank energies can be determined based on applying an inverse DCT (e.g., IDCT) to a liftered cepstrum.
  • the extracted spectral envelope features can include filterbank energies determined based on uncompanding (e.g., exp) the companded filterbank energies.
  • the extracted spectral envelope features can include a full resolution spectrum (e.g., DFT domain with companded (e.g., log, linear amplitude, etc.) information, etc.). The full resolution spectrum can be smoothed to the envelope of the spectrum, for example based on interpolating the companded or uncompanded filterbank energies.
  • the extracted spectral envelope features can include one or more linear prediction (LP) coefficients, for example determined based on processing one or more speech frames of the input audio 815 using the Levinson-Durbin algorithm (e.g., autocorrelation technique), and/or using a covariance technique.
  • the extracted spectral envelope features can include one or more of line spectral frequency (LSF) information and/or line spectral pair Qualcomm Docket No.2400178WO (LSP) information.
  • LSF line spectral frequency
  • LSP line spectral pair Qualcomm Docket No.2400178WO
  • the spectral envelope feature extraction engine 820 can be configured to extract a first set of one or more spectral features comprising cepstrum information and/or cepstral coefficients, and to extract a second set of one or more spectral features comprising LP coefficients, LSF information, and/or LSP information.
  • the second set of one or more spectral features can include LP coefficients (LPCs) estimated by the spectral envelope feature extraction engine 820 based on the Levinson-Durbin algorithm or other speech autocorrelation techniques.
  • the spectral envelope feature extraction engine 820 can estimate one or more LP coefficients from spectral envelope features (e.g., the first set of one or more spectral features comprising cepstrum information and/or cepstral coefficients, etc.), using a neural network-based technique or a DSP-based technique.
  • LSP coefficients can be determined by the spectral envelope feature extraction engine 820 based on autocorrelation of the speech signal 815.
  • the spectral envelope feature extraction engine 820 can determine one or more LSPs, and can convert the one or more LSPs to a corresponding one or more LSFs.
  • the encoded features or encoded feature information transmitted from the generative voice codec system encoder 805 can comprise a latent representation z between the encoder of the FRAE 825 and the decoder of the FRAE 825.
  • the encoder 805 can be configured to pass (e.g., transmit) the encoded features z through a channel 840 to the decoder 810.
  • the encoder 805 also processes the audio 815 using a pitch extraction engine 830 to extract pitch information from the audio 815.
  • the audio 815 can be processed in parallel by the spectral envelope feature extraction engine 820 and the pitch extraction engine 830.
  • the encoder 805 can include a denoiser 818 that processes the input audio 815 and provides a de-noised audio to the spectral envelope feature extraction engine 820 and/or to the pitch extraction engine 830.
  • the pitch extraction engine 830 can generate one or more types of pitch information.
  • the pitch extraction engine 830 can output pitch information in the frequency-domain (e.g., f0 pitch information of the audio 815, in units of Hertz (Hz)) and/or can output pitch information in the time-domain (e.g., pitch lag in samples, with or without a fractional component, and/or pitch lag in milliseconds).
  • the pitch extraction engine 830 can be used to generate pitch estimation Qualcomm Docket No.2400178WO indicative of a pitch lag (or pitch delay) from the audio 815, and/or to identify a pitch correlation from the audio 815.
  • the pitch information generated by the pitch extraction engine 830 can include a pitch lag and a pitch correlation.
  • the encoder 805 can be configured to processes the pitch information of the audio 815 (e.g., pitch, pitch lag, and/or pitch correlation, determined using the pitch extraction engine 830) using a quantizer 835 Q() to generate a quantized pitch signal.
  • the encoder 805 passes the quantized pitch signal through the channel 840 to the decoder 810.
  • the quantizer 835 Q() may also be referred to as a pitch quantizer.
  • the quantizer 835 Q() can perform vector quantization (VQ) to generate the quantized pitch signal for transmission to the decoder 810 over the channel 840.
  • VQ vector quantization
  • the quantizer 835 Q() can perform single-stage VQ and/or can perform multi- stage VQ (e.g., MSVQ).
  • the quantizer 835 Q() can be implemented using one or more FRAEs.
  • the quantizer 835 Q() can be configured to implement forward error correction (FEC) for the quantized pitch signal that is transmitted to the decoder 810 over the channel 840.
  • FEC forward error correction
  • the quantizer 835 Q() can implement multiple description coding (MDC) for the transmission of the quantized pitch signal over the channel 840, can implement full-redundancy FEC for the transmission of the quantized pitch signal over the channel 840, etc.
  • the generative voice codec system decoder 810 can be configured to receive the encoded features z from the FRAE 825 included in the generative voice codec system encoder 805.
  • the decoder 810 can receive the encoded features z via the channel 840.
  • the decoder 810 decodes the encoded features z using a FRAE 845 to generate decoded features.
  • the FRAE 845 of the decoder 810 includes only a decoder, without an encoder.
  • the FRAE 845 of the decoder 810 can include an encoder.
  • the decoded features generated by the FRAE 845 include mel-frequency cepstral coefficients (MFCC), such as MFCC-24 features (e.g., 24-dimensional MFCC features).
  • MFCC mel-frequency cepstral coefficients
  • the decoded features determined using the FRAE decoder 845 can be provided to a neural speech synthesizer 850, which can be configured to generate a reconstructed audio 855 (e.g., synthesized speech) based at least in part on the decoded spectral envelope features from the FRAE decoder 845.
  • the neural speech synthesizer 850 receives as input the decoded spectral envelope features (e.g., Qualcomm Docket No.2400178WO from the FRAE decoder 845) and the received pitch encoding information (e.g., the quantized pitch signal received by the decoder 810 over the channel 840 and from the pitch quantizer 835 of the encoder 805).
  • the neural speech synthesizer 850 can generate the reconstructed audio 855 (e.g., synthesized speech) based on processing the decoded spectral envelope features along with a pitch signal (e.g., the quantized pitch signal or a reconstructed variant thereof).
  • the decoder 810 can receive the quantized pitch signal (e.g., indicative of the pitch, the pitch lag, and/or the pitch correlation of the audio 815) from the encoder 805 via the channel 840.
  • the decoder 810 passes the quantized pitch signal to the neural speech synthesizer 850, and the neural speech synthesizer 850 processes the decoded spectral envelope features along with the quantized pitch signal to generate reconstructed audio 855 (e.g., synthesized speech).
  • the decoder 810 can receive the quantized pitch signal (e.g., indicative of the pitch, the pitch lag, and/or the pitch correlation of the audio 815) from the encoder 805 via the channel 840, and can perform dequantization of the quantized pitch signal to obtain a reconstructed pitch signal.
  • the reconstructed pitch signal can be passed to the neural speech synthesizer, and used in combination with the decoded spectral envelope features from the FRAE decoder 845 to generate the reconstructed audio 855 (e.g., synthesized speech).
  • the decoder 810 can include a reconstruction engine that is separate from the neural speech synthesizer 850.
  • the reconstruction engine processes the quantized pitch signal to reconstruct the pitch signal (e.g., including the pitch, the pitch lag, and/or the pitch correlation) before the neural speech synthesizer 850 using the reconstructed pitch signal (e.g., the pitch, the pitch lag, and/or the pitch correlation) to generate the reconstructed audio 855.
  • the output of the neural speech synthesizer 850 can be provided to one or more linear predictive coding (LPC) layers 852 included in the decoder 810.
  • LPC 852 can be used to perform linear prediction analysis and/or linear prediction synthesis, based on the output of the neural speech synthesizer 850.
  • the LPC 852 can perform linear prediction based on the output of the neural speech synthesizer 850, to generate the reconstructed audio 855 (e.g., synthesized speech).
  • the decoder 810 does not include the LPC 852, and the neural speech synthesizer 850 can be trained and/or configured to generate the reconstructed Qualcomm Docket No.2400178WO audio 855 directly (e.g., the output of the neural speech synthesizer 850 can be the reconstructed audio 855).
  • the LPC layers 852 can be used to implement a linear prediction (LP) synthesis filter, based on: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • represents the input signal to the LP filter (e.g., the input signal to the LPC layers 852)
  • ⁇ ⁇ ⁇ ⁇ represents the output signal (e.g., the output of the LPC layers 852, which can be the reconstructed audio 855)
  • ⁇ ⁇ represents the linear prediction coefficients associated with implementing the LP synthesis filter
  • is a value corresponding to the LP filter order.
  • the LP filter order p can have a value of 16 or 12, etc., among various other LP filter order values.
  • the LP synthesis filter associated with the LPC layers 852 can be implemented in the time domain or in the frequency domain.
  • the LP synthesis filter can be implemented based on a difference equation.
  • the LP synthesis filter can be implemented in the time domain by convolving the input signal x[n] with the LP filter impulse response or an approximation of the LP filter impulse response.
  • the LP filter can be associated with an infinite impulse response (IIR), which can be approximated using a finite segment of the IIR (e.g., such as the first N samples of the IIR for a configured integer value of N, etc.).
  • IIR infinite impulse response
  • the LP synthesis filter associated with the LPC layers 852 can be implemented in the frequency domain based on multiplying the FFT of the input signal x[n] with the frequency response of the LP filter, and then determining the IFFT of the result to convert the output of the frequency domain LP synthesis filter into a time domain signal corresponding to the reconstructed audio 855.
  • the LP coefficients ⁇ ⁇ can be estimated from the decoded spectral envelope features obtained using the FRAE decoder 845.
  • the spectral envelope features can be LP coefficients (e.g., determined by the spectral envelope feature extraction engine 820 based on using the Levinson-Durbin algorithm or other autocorrelation technique, or using a covariance technique, to process the input speech frames of the audio 815), and the decoded LP coefficients on the decoder side can be used for the LP filter implemented by the LPC layers 852.
  • the extracted spectral envelope features may be LSFs or LSPs, and the decoded LSFs or LSPs obtained using the FRAE decoder 845 can be Qualcomm Docket No.2400178WO converted to the corresponding LP coefficients ⁇ ⁇ .
  • the LP coefficients ⁇ ⁇ of the LPC layers 852 can be estimated from the spectral envelope features using a neural network-based and/or a DSP-based technique.
  • the extracted spectral envelope features may comprise a cepstrum, and a DSP-based technique can be used to convert the decoded cepstrum into filterbank energies, and subsequently interpolate the filterbank energies to estimate an FFT square magnitude at all FFT bins (e.g., including bins beyond the centers of filterbank filters).
  • the DSP-based technique can include applying an IFFT to obtain an estimate of the autocorrelation sequence, and utilizing the Levinson-Durbin algorithm on the estimated autocorrelation sequence to obtain the LP coefficients ⁇ ⁇ .
  • the spectral envelope feature extraction 820 can be performed based at least in part on spectral feature learning.
  • spectral feature learning can be performed by one or more machine learning models configured and/or trained to generate as output the one or more extracted spectral envelope features.
  • a cepstrum or cepstrum information may be an optional input to a machine learning spectral feature learning model used to implement the spectral envelope feature extraction 820.
  • the spectral envelope feature extraction 820 can be implemented using one or more machine learning spectral feature learning models or engines, which can be jointly trained with the neural speech synthesizer 850, an NHV used to implement the neural speech synthesizer (e.g., such as the NHV-based neural speech synthesizer 950 of FIG. 9), an LPC network used to implement the neural speech synthesizer (e.g., such as the LPC network-based neural speech synthesizer 1050 of FIG.10), etc.
  • joint training of the spectral envelope feature extraction 820 and the neural speech synthesizer 850 can be performed based on a short-time Fourier transform (STFT) loss and/or STFT loss function.
  • STFT short-time Fourier transform
  • joint training of the spectral envelope feature extraction 820 and the neural speech synthesizer 850 can be performed based on an adversarial and/or generative adversarial network (GAN) loss.
  • GAN generative adversarial network
  • joint training can be performed using a multi-resolution STFT loss, based on calculating STFT amplitude spectrograms from ground truth speech information and the synthesized speech information 855.
  • the multi-resolution STFT loss can then be determined as a sum of mean absolute error and mean absolute error in the log domain.
  • the STFT amplitude spectrograms can be calculated at different window lengths to Qualcomm Docket No.2400178WO capture representations of the error and/or loss at different time and/or frequency resolutions.
  • joint training based on adversarial or GAN loss can be used to learn temporal fine structures in speech signals.
  • An adversarial or GAN loss can be used to match the distribution of real speech and the synthesized speech 855.
  • the neural speech synthesizer 850 (and/or an NHV-based neural speech synthesizer) can be used as the generator network.
  • a separate discriminator network can be used to train based on the adversarial loss, with both the NHV and the discriminator jointly trained using adversarial training techniques.
  • the discriminator network can be implemented as a classifier configured to classify whether an input audio represents real speech or synthesized speech.
  • the generator network can attempt to fool the discriminator network to classify synthesized speech as real speech. Over the course of training, the generator network improves and begins producing (e.g., generating or synthesizing) speech that is very similar to the real speech.
  • the discriminator network can be implemented using WaveNet.
  • the neural speech synthesizer 850 can be implemented using one or more trained neural networks.
  • the one or more trained neural networks can be trained to perform speech synthesis and/or signal synthesis.
  • the neural speech synthesizer 850 can be implemented using or based on an LPCNet machine learning architecture, a WaveNet machine learning architecture, a WaveRNN machine learning architecture, etc.
  • the neural speech synthesizer 850 can be provided as a neural homomorphic vocoder (NHV).
  • FIG.9 illustrates an example of an audio codec system 900 (e.g., generative voice codec) that includes a decoder 910 configured to implement an NHV- based neural speech synthesizer.
  • an NHV is a type of neural vocoder that can synthesize speech with source-filter models controlled by one or more neural networks.
  • An NHV-based neural vocoder may include one or more neural networks in a source-filter model that can synthesize speech based on filtering impulse trains and noise with linear time-varying (LTV) filters, with the one or more neural networks used to control the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features.
  • Traditional or non-neural vocoders may operate based on decomposing speech into various parameters such as pitch, timbre, rhythm, etc., which can subsequently be manipulated and resynthesized to generate a desired output audio or voice signal.
  • Neural vocoders can apply transformations and Qualcomm Docket No.2400178WO manipulations to speech signals directly within a learned feature space of the neural network.
  • the learned feature space used by neural vocoders may capture more complex relationships and characteristics of speech than non-neural network-based signal processing and/or vocoder techniques.
  • neural vocoders can be trained to learn a mapping between raw speech waveforms and the spectral or cepstral representations of the speech.
  • An NHV system can apply one or more homomorphic processing techniques within the learned space corresponding to the mapping.
  • FIG. 9 is a block diagram illustrating an example of an audio codec system 900 (e.g., a generative voice codec, a voice coding signal synthesis system, etc.) that includes a decoder 910 that can be used to generate reconstructed audio (e.g., synthesized speech) using an NHV-based neural speech synthesizer 950, in accordance with some examples.
  • the decoder 910 may also be referred to as an NHV synthesizer decoder.
  • the audio codec system 900 of FIG.9 can be the same as or similar to the audio codec system 800 of FIG. 8.
  • the channel 940 of FIG. 9 can be the same as or similar to the channel 840 of FIG.
  • the NHV synthesizer decoder 910 can include the FRAE decoder 945, which may be the same as or similar to the FRAE decoder 845 of FIG.8, and may include a pitch de-quantization engine 938 (e.g., de Q()).
  • the pitch de- quantization engine 938 can be used to process a received pitch encoding obtained by the decoder 910 over the channel 940.
  • the received pitch encoding information can be quantized pitch information or a quantized pitch signal transmitted to the decoder 910 over the channel 940 by a corresponding encoder (e.g., an encoder associated with the decoder 910, which may be the same as or similar to the encoder 805 of FIG.8, etc.).
  • the pitch dequantization engine 938 can generate reconstructed pitch information based on performing a codebook lookup to dequantize the received pitch encoding obtained from the channel 940 (e.g., to dequantize the quantized pitch encoding received by the decoder 910 over the channel 940).
  • the dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 938 can include pitch information in the frequency-domain (e.g., f0 pitch frequency information in Hz), pitch information in the time-domain (e.g., pitch lag or pitch delay information in samples, milliseconds, etc.), and/or pitch correlation Qualcomm Docket No.2400178WO information.
  • the dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 938 can include information indicative of a voiced or unvoiced (V/UV) classification.
  • the neural speech synthesizer 950 can be implemented as a neural homomorphic vocoder (NHV) speech synthesizer and/or an NHV-based neural speech synthesizer.
  • the neural speech synthesizer 950 can include a neural filter estimator 952 configured to generate respective filter specification or filter configuration information to parameterize one or more linear time- varying (LTV) filters of the NHV speech synthesizer 950.
  • the neural filter estimator 952 can be trained to generate respective filter coefficients for a noise linear time-varying filter (LTVF) 975 and to generate respective filter coefficients for a harmonic LTVF 965.
  • LTVF noise linear time-varying filter
  • the neural network model of the neural filter estimator 952 can include any neural network architecture that can be trained to model the filter coefficients for the noise LTVF 975 and the harmonic LTVF 965 (e.g., and/or that can be trained to model the filter coefficients for one or more additional filters implemented by the NHV speech synthesizer 950, in either the time-domain, the frequency-domain, or combinations thereof).
  • Examples of neural network architectures that may be included in the neural network filter estimator 952 can include a generative neural network (e.g., a generative- adversarial network (GAN)), convolutional neural networks (CNN), an autoencoder, and/or other type(s) of neural networks, etc.
  • GAN generative- adversarial network
  • CNN convolutional neural networks
  • autoencoder an autoencoder
  • the neural filter estimator 952 can be configured to receive a stream of decoded spectral envelope features from the FRAE decoder 945. Based on the decoded spectral envelope features, the neural filter estimator 952 can generate corresponding filter characterization parameters for each filter of one or more filters included in the NHV speech synthesizer 950. The respective filter characterization parameters can be output from the neural filter estimator 952 and used to parameterize and/or configure corresponding learned filters (e.g., learned linear filters, etc.) for each respective set of filter characterization parameters.
  • learned filters e.g., learned linear filters, etc.
  • the neural filter estimator 952 can generate a first set of filters corresponding to the noise LTVF 975, can generate a second set of filter characterization parameters corresponding to the harmonic LTVF 965, etc.
  • the one or more filters included in the NHV speech synthesizer 950 e.g., the Qualcomm Docket No.2400178WO noise LTVF 975, the harmonic LTVF 965, etc.
  • the noise LTVF 975, the harmonic LTVF 965, and/or various other filters that may be included in the NHV speech synthesizer 950 can be specified or characterized (e.g., using the filter characterization parameters determined by the neural filter estimator 952) as cepstrums, as frequency responses, as time-domain impulse responses, as difference equation coefficients, etc.
  • the neural filter estimator 952 can be configured to receive as input the decoded spectral envelope features from the FRAE decoder 945, and may additionally receive as input at least a portion of the dequantized pitch information generated by the pitch dequantization engine 938.
  • the neural filter estimator 952 may receive as input the decoded spectral envelope features and pitch dequantization information (e.g., such as f0 pitch information in the frequency domain, pitch lag or pitch delay in the time domain (e.g., in units of samples or milliseconds, etc.), etc.).
  • the pitch dequantization information provided as input to the neural filter estimator 952 may include voiced/unvoiced (V/UV) classification information indicating whether the underlying audio represented in the encoded information received by the decoder 910 over the channel 940 corresponds to voiced or unvoiced sounds, speech, etc.
  • V/UV voiced/unvoiced
  • the neural filter estimator 952 can generate the corresponding filter characterization parameters for each NHV filter (e.g., noise LTVF 975, harmonic LTVF 965, etc.) based on the decoded spectral envelope features and the dequantized pitch information.
  • the noise LTVF 975 and the harmonic LTVF 965 can be implemented as time domain filters or frequency domain filters. In some examples, the noise LTVF 975 and the harmonic LTVF 965 can be implemented in the time domain or the frequency domain, independent of whether the neural filter estimator 952 is configured to generate the corresponding filter characterization parameters in the time domain or the frequency domain.
  • the neural filter estimator 952 can output impulse response-based filter characterization parameters, and the operation of noise LTVF 975 and/or harmonic LTVF 965 can be implemented in the time domain as convolutions with the impulse response.
  • the operation of noise LTVF 975 and/or harmonic LTVF 965 may be Qualcomm Docket No.2400178WO implemented in the frequency domain based on converting the impulse response to frequency response (e.g., using an FFT transform) and multiplying with the FFT of the input signal, and subsequently converting back to the time domain using an IFFT transform.
  • the NHV speech synthesizer 950 can include a pulse train generator 960.
  • the dequantized pitch information (e.g., generated using the pitch dequantization engine 938) can be provided to the pulse train generator 960.
  • the pulse train generator 960 can generate a pulse train based at least in part on the pitch frequency (e.g., f 0 ) and/or pitch lag or pitch delay information obtained from the pitch dequantization engine 938 of the decoder 910.
  • the pulse train generator 960 can be implemented as a cosine sum pulse generator, which can be configured to process the dequantized pitch information (e.g., obtained from the pitch dequantization engine 938) to generate a pulse train p[n].
  • the pulse train may also be referred to as an impulse train.
  • the pulse train generator 960 can be a differentiable cosine sum pulse generator, and/or can be a non-differentiable cosine sum pulse generator (e.g., among various other pulse generators).
  • the pulse train generated by the pulse train generator 960 can be processed by the harmonic LTVF 965, using the corresponding harmonic filter characterization parameters determined by the neural filter estimator 952 for the harmonic LTVF 965.
  • the harmonic LTVF 965 can generate a harmonic output based on processing the pulse train from the pulse train generator 960.
  • the pulse train generator 960 may receive an additional input from the pitch dequantization engine 938, indicative of a voice or unvoiced (e.g., V/UV) classification.
  • the pulse train generator 960 can be configured to generate a 0 output or a noise output that is provided to the harmonic LTVF 965 instead of the pulse train p[n] (e.g., instead of the pulse train p[n] provided from the pulse train generator 960 to the harmonic LTVF 965 in response to a voiced (V) indication or classification from the pitch dequantization engine 938).
  • the NHV speech synthesizer 950 can include a noise generator 970 that is configured to generate a noise signal (e.g., white noise, etc.) for processing by the noise LTVF 975.
  • the noise LTVF 975 can be parameterized based on the Qualcomm Docket No.2400178WO respective noise filter characterization parameters generated by the neural filter estimator 952 for the noise LTVF 975, and can subsequently be used to process the noise signal generated by the noise generator 970. Based on processing the noise signal from the noise generator 970, the noise LTVF 975 can generate a noise-filtered output.
  • the NHV speech synthesizer 950 can include a combination function 980 to combine the harmonic output (e.g., from the harmonic LTVF 965) and the noise-filtered output (e.g., from the noise LTVF 975) to generate the reconstructed audio 955.
  • the reconstructed audio 955 includes synthesized speech.
  • the reconstructed audio 955 can be the same as or similar to the reconstructed audio 855 of FIG. 8.
  • the combined noise LTVF 975 filter output and harmonic LTVF 965 filter output (e.g., from the combination operation 980) can comprise the reconstructed audio 955 (e.g., synthesized speech).
  • the output of the combination operation 980 can be provided to an LPC 954, which may be the same as or similar to the LPC 852 of FIG.8.
  • the LPC 954 can perform linear predictive coding on the LTVF filter output from the NHV speech synthesizer 950, and may generate as output the reconstructed audio 955 (e.g., synthesized speech).
  • FIG.10 is a block diagram illustrating an example of a decoder 1010 of a voice coding signal synthesis system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer comprising a linear prediction coding (LPC) network.
  • the decoder 1010 can be included in an audio codec system (e.g., generative voice codec system) 1000 that can be used to generate reconstructed audio (e.g., synthesized speech) 1055.
  • the neural speech synthesizer 1050 can receive a first input comprising decoded spectral envelope features (e.g., determined by the FRAE decoder 1045).
  • the neural speech synthesizer 1050 can receive a second input comprising a received pitch encoding (e.g., received pitch information), that is the same as or similar to the received pitch encoding provided to the neural speech synthesizer 850 of FIG.8, etc.
  • the decoder 1010 may include a pitch de-quantization engine 1038 (e.g., de Q()).
  • the Qualcomm Docket No.2400178WO pitch de-quantization engine 1038 can be used to process a received pitch encoding obtained by the decoder 1010 over the channel 1040.
  • the received pitch encoding information can be quantized pitch information or a quantized pitch signal transmitted to the decoder 1010 over the channel 1040 by a corresponding encoder (e.g., an encoder associated with the decoder 1010, which may be the same as or similar to the encoder 805 of FIG. 8, etc.).
  • the pitch dequantization engine 1038 can generate reconstructed pitch information based on performing a codebook lookup to dequantize the received pitch encoding obtained from the channel 1040 (e.g., to dequantize the quantized pitch encoding received by the decoder 1010 over the channel 1040).
  • the dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 1038 can include pitch information in the frequency-domain (e.g., f 0 pitch frequency information in Hz), pitch information in the time-domain (e.g., pitch lag or pitch delay information in samples, milliseconds, etc.), and/or pitch correlation information.
  • the dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 1038 can include information indicative of a voiced or unvoiced (V/UV) classification.
  • the neural speech synthesizer 1050 can be implemented as a linear predictive coding (LPC) network.
  • LPC linear predictive coding
  • the LPC network-based neural speech synthesizer 1050 can be used to implement a linear prediction (LP) synthesis filter, based on ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • the LPC-based neural speech synthesizer 1050 can include a frame rate network 1052 configured to process inputs comprising the decoded features obtained using the FRAE decoder 1045 and the decoded pitch information obtained using the pitch dequantization engine 1038.
  • An LPC estimation engine 1062 can process the decoded features from the FRAE decoder 1045 to determine one or more estimated LP coefficients.
  • the LPC estimation engine 1062 can generate estimated LP coefficients ⁇ ⁇ , based on an input comprising the decoded features determined using the FRAE decoder 1045.
  • the LP coefficients estimated using the LPC estimation engine 1062 can be provided to an LP prediction engine 1064, configured to generate as output a prediction p[n], where ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • the LP prediction engine 1064 can generate the LP prediction p[n] based on a first input comprising the LP coefficients ⁇ ⁇ the LPC estimation engine 1062, and a feedback input s[n-1].
  • the estimated or predicted LP coefficients p[n] can be provided as input to a sample rate network 1072 and a downstream combination or summation operation 1080.
  • the sample rate network 1072 can receive additional inputs comprising the output of the frame rate network 1052, the feedback s[n-1] from the previous step n-1, and an intermediate feedback value e[n-1] from the same previous step n-1.
  • the output of the sample rate network 1072 can be the probability distribution P(e[n]), which is a probability distribution for e[n].
  • a sampling engine 1074 can perform sampling from the probability distribution P(e[n]) to obtain a realization or representation of e[n].
  • the representation of e[n] determined by the sampling engine 1074 can be combined with the LP prediction p[n] (e.g., determined by the LP prediction engine 1064), using the combination or summation operation 1080 to thereby generate as output the signal s[n].
  • the output signal s[n] can be the same as the reconstructed audio signal 1055 of the LPC network-based neural speech synthesizer 1050 and/or decoder 1010.
  • the representation of e[n] determined by the sampling engine 1074 can additionally be provided to a first feedback calculation 1078, which generates as output e[n-1] provided as an additional input to the sample rate network 1072.
  • the output signal s[n] of the combination or summation operation 1080 can be output as the reconstructed audio 1055 and may additionally be provided to a second feedback calculation 1079, which generates the representation s[n-1] based on the input s[n].
  • the representation s[n-1] can be provided as a feedback input to the LP prediction engine 1064 and to the sample rate network 1072.
  • the systems and techniques described herein can be used to provide dynamic complexity adjustment for one or more of a Qualcomm Docket No.2400178WO decoder and/or an encoder included in an audio codec system, such as the audio codec system 800 of FIG.8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG.10, etc.
  • the systems and techniques can be used to provide dynamic decoder complexity adjustment for one or more encoders and/or decoders included in the audio codec system 800 of FIG.8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG. 10, etc.
  • the systems and techniques can be used to provide dynamic decoder complexity adjustment, based on the decoder complexity switching engine 550 of FIG.5 being included in the decoder 810 of the audio codec system 800 of FIG. 8, the decoder 910 of the audio codec system 900 of FIG. 9, the decoder 1010 of the audio codec system 1000 of FIG. 10, etc.
  • one or more of the device resource monitor 520 of FIG. 5 and/or the decoder profiling engine 530 can be included in the decoder 810 of the audio codec system 800 of FIG. 8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc.
  • the systems and techniques can be used to provide dynamic encoder complexity adjustment for an encoder included in an audio codec system, such as the audio codec system 800 of FIG.8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG.10, etc.
  • the decoder complexity switching engine 550 of FIG. 5 (or an encoder complexity switching engine the same as or similar to the decoder complexity switching engine 550 of FIG. 5) can be used to provide dynamic encoder complexity switching for one or more of the encoder 805 of FIG.8, the encoder 905 of FIG.9, the encoder 1005 of FIG.10, etc.
  • one or more of the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc. can be included in the plurality of decoders with different synthesis complexity 510-1, 510-2, ..., 510-N of FIG. 5.
  • the decoder 1010 of the audio codec system 1000 of FIG.10, etc. can be included in the plurality of decoders with different synthesis complexity 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, 610-8 of FIGS.6A-6B.
  • the systems and techniques for decoder complexity Qualcomm Docket No.2400178WO switching can be used to adjust the complexity of the neural speech synthesizer 850 included in the decoder 810 of the audio codec system 800 of FIG. 8.
  • the decoder 810 of FIG.8 can include the decoder complexity switching engine 550 of FIG.
  • the decoder complexity switching engine 550 can switch between different decoder models or decoder instances for the neural speech synthesizer 850, where each decoder model or decoder instance is associated with a different respective complexity.
  • the different decoder models or decoder instances with different complexity 510-1, 510-2, ..., 510-N of FIG.5 can correspond to different decoder models or decoder instances configured to implement the neural speech synthesizer 850 of FIG. 8.
  • the first decoder model 510-1 can be a first neural speech synthesizer 850 model with a first complexity level
  • the second decoder model 510-2 can be a second neural speech synthesizer 850 model with a second complexity level
  • the Nth decoder model 510-N can be an Nth neural speech synthesizer 850 model with an Nth complexity level.
  • the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 810 and/or the neural speech synthesizer 850 of FIG.8, based on using the decoder complexity switching engine 550 of FIG.5 to add or remove layers from a neural network model used for the neural speech synthesizer 850, and/or to switch between different models for the neural speech synthesizer 850.
  • the decoder 810 of FIG.8 can include one or more (or all) of the components of the dynamic decoder complexity switching system 500 of FIG. 5.
  • the decoder 810 of FIG.8 can include and/or implement the dynamic decoder complexity switching system 500 of FIG.
  • the decoder 810 of FIG. 8 can include the device resource monitor 520 of FIG. 5 and can perform dynamic decoder complexity switching for the neural speech synthesizer 850 based at least in part on the device resource monitoring information obtained from the device resource monitor 520.
  • the decoder 810 of FIG.8 can include the decoder profiling engine 530 of FIG. 5, and can perform dynamic decoder complexity switching for the neural speech synthesizer 850 based at least in part on the decoder profiling information obtained from the decoder profiling engine 530.
  • the systems and techniques for decoder complexity switching Qualcomm Docket No.2400178WO can be used to adjust the complexity of the NHV-based neural speech synthesizer 950 of FIG. 9.
  • the systems and techniques for decoder complexity switching can be used to adjust the complexity of the neural filter estimator 952 included in the NHV-based neural speech synthesizer 950 of FIG. 9.
  • the decoder complexity switching engine 550 of FIG.5 can be included in the decoder 910 of FIG.9, and used to switch between different decoder models or decoder instances for the neural filter estimator 952 of the NHV-based neural speech synthesizer 950 of FIG.9.
  • Each decoder model or decoder instance used to perform complexity switching for the neural filter estimator 952 can be associated with a different respective complexity.
  • the different decoder models or decoder instances with different complexity 510-1, 510-2, ..., 510-N of FIG.5 can correspond to different decoder models or decoder instances configured to implement the neural filter estimator 952 of FIG.9.
  • the first decoder model 510-1 of FIG.5 can be a first neural filter estimator 952 model with a first complexity level
  • the second decoder model 510-2 of FIG.5 can be a second neural filter estimator 952 model with a second complexity level
  • the Nth decoder model 510-N of FIG.5 can be an Nth neural filter estimator 952 model with an Nth complexity level.
  • the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 910 and/or the NHV-based neural speech synthesizer 950 of FIG.9 and/or the neural filter estimator 952 of FIG.9, based on using the decoder complexity switching engine 550 of FIG.
  • the decoder 910 of FIG. 9 can include one or more (or all) of the components of the dynamic decoder complexity switching system 500 of FIG. 5.
  • the decoder 910 of FIG.9 can include and/or implement the dynamic decoder complexity switching system 500 of FIG. 5, to perform complexity switching for the NHV-based neural speech synthesizer 950 and/or the neural filter estimator 952 of FIG. 9.
  • the decoder 910 of FIG. 9 can include the device resource monitor 520 of FIG.
  • the decoder 910 of FIG.9 can include the decoder profiling engine 530 of FIG.5, and can perform dynamic decoder complexity switching for the neural filter estimator 952 and/or the neural speech synthesizer 950, based at least in part on the decoder profiling information obtained from the decoder profiling engine 530.
  • the systems and techniques for decoder complexity switching can be used to adjust the complexity of the LPC-based neural speech synthesizer 1050 of FIG.10.
  • the systems and techniques for decoder complexity switching can be used to adjust the complexity of the frame rate network 1052 of the LPC-based neural speech synthesizer 1050 of FIG.10.
  • the systems and techniques for decoder complexity switching can be used to adjust the complexity of the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG. 10.
  • the systems and techniques for decoder complexity switching can be used to adjust the complexity of both the frame rate network 1052 and the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050 of FIG.10.
  • the decoder complexity switching engine 550 of FIG. 5 can be included in the decoder 1010 of FIG.
  • the decoder 1010 can include one instance of the decoder complexity switching engine 550 of FIG.5, and may use the same instance of the decoder complexity switching engine 550 to perform complexity switching for both the frame rate network 1052 and the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050 of FIG.10. [0242] In some examples, the decoder 1010 of FIG.
  • each decoder model or decoder instance used to perform complexity switching for the frame rate network 1052 and/or the sample rate network 1072 can be associated Qualcomm Docket No.2400178WO with a different respective complexity.
  • the different decoder models or decoder instances with different complexity 510-1, 510-2, ..., 510-N of FIG. 5 can correspond to different decoder models or decoder instances configured to implement the sample rate network 1072 of FIG.10.
  • the first decoder model 510-1 of FIG. 5 can be a first frame rate network 1052 model with a first complexity level
  • the second decoder model 510-2 of FIG. 5 can be a second frame rate network 1052 model with a second complexity level, ...
  • the first decoder model 510-1 of FIG.5 can be a first sample rate network 1072 model with a first complexity level
  • the second decoder model 510-2 of FIG.5 can be a second sample rate network 1072 model with a second complexity level
  • the Nth decoder model 510-N of FIG. 5 can be an Nth sample rate network 1072 model with an Nth complexity level.
  • the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 1010 and/or the LPC-based neural speech synthesizer 1050 of FIG.
  • the frame rate network 1052 e.g., one or more, or both, of the frame rate network 1052 and/or the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050
  • the decoder complexity switching engine 550 of FIG. 5 to add or remove layers from a neural network model used for the frame rate network 1052, and/or to switch between different models for the frame rate network 1052.
  • the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 1010 and/or the LPC-based neural speech synthesizer 1050 of FIG.10 (e.g., one or more, or both, of the frame rate network 1052 and/or the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050), based on using the decoder complexity switching engine 550 of FIG. 5 to add or remove layers from a neural network model used for the sample rate network 1072, and/or to switch between different models for the sample rate network 1072.
  • the decoder 1010 of FIG. 10 can include one or more (or all) of the components of the dynamic decoder complexity switching system 500 of FIG. 5.
  • the decoder 1010 of FIG. 10 can include and/or implement the dynamic decoder complexity switching system 500 of FIG.5, to perform complexity switching for the LPC-based neural speech synthesizer 1050 and one or more (or both) of the frame rate network 1052 and/or the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050.
  • the decoder 1010 of FIG.10 can include the device resource monitor 520 of FIG.5 and can perform dynamic decoder complexity switching for the frame rate network 1052 and/or the sample rate network 1072, based at least in part on the device resource monitoring information obtained from the device resource monitor 520.
  • the decoder 1010 of FIG.10 can include the decoder profiling engine 530 of FIG. 5, and can perform dynamic decoder complexity switching for the frame rate network 1052 and/or the sample rate network 1072 of FIG. 10, based at least in part on the decoder profiling information obtained from the decoder profiling engine 530.
  • the systems and techniques can be configured to use the decoder complexity switching engine 550 of FIG.5 to switch between one or more neural network models associated with the NHC-based neural speech synthesizer 950 of FIG.9, and one or more neural network models associated with the LPC-based neural speech synthesizer 1050 of FIG.10. For example, the plurality of different decoder models with different respective complexity levels of FIG.
  • the decoder models 510-1, 510-2, ..., 510-N each associated with a respective complexity level can include a first subset of different neural network models each with a different respective complexity for implementing the neural filter estimator 952 of the NHV-based neural speech synthesizer 950 of FIG.9, can include a second subset of different neural network models each with a different respective complexity for implementing the frame rate network 1052 of the LPC-based neural speech synthesizer 1050 of FIG. 10, and/or can include a third subset of different neural network models each with a different respective complexity for implementing the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG.10, etc.
  • the NHV-based synthesizer 950 of FIG. 9 can be associated with a lower complexity than the LPC-based synthesizer 1050 of FIG. 10 (e.g., the LPC-based synthesizer 1050 of FIG. 10 can be higher complexity than the NHV-based synthesizer 950 of FIG. 9).
  • the systems and techniques can use the decoder complexity switching engine 550 to switch between a relatively high complexity decoder configuration corresponding to the LPC-based synthesizer 1050 of FIG.
  • FIG. 11 is a flowchart diagram illustrating an example of a process 1100 for processing one or more audio samples.
  • the process 1100 can be performed by a computing device or apparatus or a component or system (e.g., one or more chipsets, one or more processors such as one or more CPUs, DSPs, NPUs, NSPs, microcontrollers, ASICs, FPGAs, programmable logic devices, discrete gates or transistor logic components, discrete hardware components, etc., any combination thereof, and/or other component or system) of the computing device or apparatus.
  • the operations of the process 1100 may be implemented as software components that are executed and run on one or more processors (e.g., processor 1210 of FIG. 12 or other processor(s)).
  • the process 1100 can be performed by an audio coding system or component thereof, including any of the audio coding systems and/or components thereof of FIGS. 1-10.
  • the process 1100 can be performed by a UE, smartphone, mobile computing device, user computing device, etc.
  • the process 1100 may be performed by an apparatus that may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device or augmented reality (AR) device, a vehicle or component or system of a vehicle, or other type of computing device.
  • a mobile device e.g., a mobile phone
  • XR extended reality
  • VR virtual reality
  • AR augmented reality
  • the apparatus can obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit.
  • the audio frame can be included in the bitstream 502 of FIG.5.
  • the configured processing time completion limit can be the same as or similar to the configured processing time completion limit associated with the remaining time budget and the real-time processing limit comparisons 740-1, 740-2, ..., etc., of FIG.7.
  • the configured processing time completion limit is equal to a periodicity between consecutive frames included in the sequence of the one or more audio Qualcomm Docket No.2400178WO frames. In some examples, the configured processing time completion limit is equal to a duration of each audio frame included in the sequence of the one or more audio frames. For example, the duration of each audio frame can be 20 milliseconds. In some cases, the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames. [0253] At block 1104, the apparatus (or component thereof) can obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus.
  • the monitoring information can be obtained from one or more of the device resource monitor 520 of FIG.5 and/or the decoder profiling engine 530 of FIG. 5.
  • the monitoring information can be obtained from both the device resource monitor 520 of FIG.5 and/or the decoder profiling engine 530 of FIG.5.
  • monitoring information associated with one or more previously decoded frames can be obtained using the decoder profiling engine 530 of FIG. 5, and the monitoring information associated with a resource utilization of the apparatus can be obtained from the device resource monitor 520 of FIG.5.
  • the previously decoded frame is decoded using a neural network- based decoder model of the plurality of neural network-based decoder models different from the selected neural network-based decoder model.
  • the apparatus or component thereof can be configured to obtain device load information indicative of the resource utilization of the apparatus, where the device load information is obtained from a resource monitor of the apparatus.
  • the device load information can be obtained from the device resource monitor 520 of FIG.5.
  • the apparatus or component thereof
  • the apparatus can be configured to obtain one or more decoding time measurements corresponding to the previously decoded frame.
  • the one or more decoding time measurements can be obtained using the decoder profiling engine 530 of FIG.5.
  • the apparatus (or component thereof) can be configured to obtain device load information indicative of the resource utilization of the apparatus, where the device load information is obtained from Qualcomm Docket No.2400178WO a resource monitor of the apparatus, and to obtain one or more decoding time measurements corresponding to the previously decoded frame, where the one or more decoding time measurements are obtained from a decoder system of the apparatus.
  • the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame.
  • the one or more decoding time measurements can include one or more of a decoder start time associated with decoding the previously decoded frame, a decoder end time associated with decoding the previously decoded frame, or a decoder elapsed time associated with decoding the previously decoded frame.
  • the monitoring information is indicative of a respective one or more decoding time measurements corresponding to each previously decoded frame of multiple previously decoded frames included in the sequence.
  • the monitoring information is indicative of a current utilization percentage associated with the one or more processors.
  • the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the apparatus.
  • CPU central processing unit
  • DSP digital signal processor
  • the monitoring information is indicative of a memory usage associated with the apparatus.
  • the resource utilization of the apparatus is an instantaneous resource utilization.
  • the resource utilization of the apparatus is an average resource utilization over a configured length of time.
  • the apparatus (or component thereof) can determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity.
  • the selected neural network-0based decoder model is selected from the plurality of neural network-based neural decoder models 510-1, 510-2, ..., 510-N of FIG.5.
  • Qualcomm Docket No.2400178WO the plurality of neural network-based decoder models can each have a respective complexity, and can be the same as or similar to one or more of the decoder models each with respective complexity 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, 610-8 of FIGS.6A-6B.
  • the selected neural network-based decoder model is a selected neural speech synthesizer model, for example corresponding to the neural speech synthesizer 850 of FIG.8, 950 of FIG.9, and/or 1050 of FIG.10, etc. In some cases, the selected neural network-based decoder model is selected from a plurality of neural speech synthesizer neural network models. [0264] In some cases, the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis, and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis.
  • NHV neural homomorphic vocoder
  • the selected NHV model for neural speech synthesis can correspond to the NHV-based neural speech synthesizer 950 of FIG. 9.
  • the selected neural network-based decoder model is a selected neural network model corresponding to a neural filter estimator of a neural homomorphic vocoder (NHV)-based neural speech synthesizer.
  • the selected neural network-based decoder model can be a selected neural network model corresponding to the neural filter estimator 952 of FIG.9.
  • the selected neural network-based decoder model is a selected linear predictive coding (LPC) model for neural speech synthesis, and the selected neural network-based decoder model is selected from a plurality of LPC-based neural network models for neural speech synthesis.
  • LPC linear predictive coding
  • the selected neural network-based decoder model can correspond to the LPC-based neural speech synthesizer 1050 of FIG. 10.
  • the selected neural network-based decoder model is a selected neural network model corresponding to a frame rate network of a linear predictive coding (LPC)-based neural speech synthesizer.
  • the selected neural network-based decoder model can correspond to the frame rate network 1052 of the LPC-based neural speech synthesizer 1050 of FIG.10.
  • the selected neural network-based decoder model is a selected neural network model corresponding to a sample rate network of a linear predictive coding (LPC)-based neural speech synthesizer.
  • LPC linear predictive coding
  • the selected neural network-based decoder model can Qualcomm Docket No.2400178WO correspond to the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG.10.
  • the corresponding complexity is an increased decoder complexity
  • the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the apparatus, or a processing time of the previously decoded frame being less than the configured processing time completion limit.
  • the corresponding complexity is a decreased decoder complexity
  • the selected neural network-based decoder model is selected based on one or more of an indication of an increase in the resource utilization of the apparatus, or a processing time of the previously decoded frame exceeding the configured processing time completion limit.
  • the apparatus (or component thereof) can process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame.
  • the apparatus (or component thereof) can be configured to switch from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, where the first neural network-based decoder model is included in the plurality of neural network-based decoder models.
  • switching between the first and second neural network-based decoder models can be performed using the switch 545 of FIG. 5, and based on decoder complexity switching information determined by the decoder complexity switching engine 550 of FIG.5.
  • the corresponding complexity of the selected neural network- based decoder model is greater than a first complexity associated with the first neural network-based decoder model. In some examples, the corresponding complexity of the selected neural network-based decoder model is lesser than a first complexity associated with the first neural network-based decoder model.
  • the apparatus can be configured to Qualcomm Docket No.2400178WO determine an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model, and to determine a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time.
  • the apparatus can be configured to skip processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold.
  • the apparatus (or component thereof) can output the decoded audio frame within the configured processing time completion limit.
  • the apparatus comprises an audio decoder or a voice decoder of a signal synthesis system.
  • the apparatus further comprises a microphone configured to obtain the one or more audio frames.
  • the apparatus further comprises one or more speakers configured to output the decoded audio frame within the configured processing time completion limit.
  • an apparatus used to implement the process 1100 can include microphone configured to obtain the one or more audio samples.
  • the apparatus further comprises one or more microphones configured to capture the one or more audio samples for speech synthesis.
  • the processes described herein may be performed by a computing device or apparatus.
  • the process 1100 and/or other technique or process described herein can be performed by a computing system having an architecture according to any of FIGS.1- 12B.
  • the process 1100 and/or other technique or process described herein can be performed by the computing system 1200 shown in FIG. 12.
  • a computing device with the computing device architecture of the computing system 1200 shown in FIG.12 can implement the operations of the process 1100, and/or can implement one or more of the components and/or operations described herein with respect to any of FIGS.1-10.
  • the computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) that are configured to carry out Qualcomm Docket No.2400178WO the steps of processes described herein.
  • the computing device may include a display, one or more network interfaces configured to communicate and/or receive the data, any combination thereof, and/or other component(s).
  • the one or more network interfaces may be configured to communicate and/or receive wired and/or wireless data, including data according to the 3G, 4G, 5G, and/or other cellular standard, data according to the WiFi (802.11x) standards, data according to the Bluetooth TM standard, data according to the Internet Protocol (IP) standard, and/or other types of data.
  • the components of the computing device may be implemented in circuitry.
  • the components may include and/or may be implemented using electronic circuits or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or may include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein.
  • programmable electronic circuits e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits
  • CPUs central processing units
  • the process 1100 is illustrated as a logical flow diagram, the operation of which represents a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof.
  • the operations represent computer-executable instructions stored on one or more computer- readable storage media that, when executed by one or more processors, perform the recited operations.
  • computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types.
  • the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the processes.
  • FIG. 12 is a diagram illustrating an example of a system for implementing certain aspects of the present technology.
  • FIG.12 illustrates an example of computing system 1200, which may be for example any computing device making up internal computing system, a remote computing system, a camera, or any component thereof in which the components of the system are in communication with each other using connection 1205.
  • Connection 1205 may be a physical connection using a bus, or a direct connection into processor 1210, such as in a chipset architecture.
  • Connection 1205 may also be a virtual connection, networked connection, or logical connection.
  • computing system 1200 is a distributed system in which the functions described in this disclosure may be distributed within a datacenter, multiple data centers, a peer network, etc.
  • one or more of the described system components represents many such components each performing some or all of the function for which the component is described.
  • Example system 1200 includes at least one processing unit (CPU or processor) 1210 and connection 1205 that communicatively couples various system components including system memory 1215, such as read-only memory (ROM) 1220 and random access memory (RAM) 1225 to processor 1210.
  • system memory 1215 such as read-only memory (ROM) 1220 and random access memory (RAM) 1225 to processor 1210.
  • Computing system 1200 may include a cache 1215 of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 1210.
  • Processor 1210 may include any general-purpose processor and a hardware service or software service, such as services 1232, 1234, and 1236 stored in storage device 1230, configured to control processor 1210 as well as a special-purpose processor where software instructions are incorporated into the actual processor design.
  • Processor 1210 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc.
  • a multi-core processor may be symmetric or asymmetric.
  • computing system 1200 includes an input device 1245, which may represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion Qualcomm Docket No.2400178WO input, speech, etc.
  • Computing system 1200 may also include output device 1235, which may be one or more of a number of output mechanisms.
  • multimodal systems may enable a user to provide multiple types of input/output to communicate with computing system 1200.
  • Computing system 1200 may include communications interface 1240, which may generally govern and manage the user input and system output.
  • the communication interface may perform or facilitate receipt and/or transmission wired or wireless communications using wired and/or wireless transceivers, including those making use of an audio jack/plug, a microphone jack/plug, a universal serial bus (USB) port/plug, an Apple TM Lightning TM port/plug, an Ethernet port/plug, a fiber optic port/plug, a proprietary wired port/plug, 3G, 4G, 5G and/or other cellular data network wireless signal transfer, a Bluetooth TM wireless signal transfer, a Bluetooth TM low energy (BLE) wireless signal transfer, an IBEACON TM wireless signal transfer, a radio-frequency identification (RFID) wireless signal transfer, near-field communications (NFC) wireless signal transfer, dedicated short range communication (DSRC) wireless signal transfer, 802.11 Wi-Fi wireless signal transfer, wireless local area network (WLAN) signal transfer, Visible Light Communication (VLC), Worldwide Inter
  • the communications interface 1240 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers that are used to determine a location of the computing system 1200 based on receipt of one or more signals from one or more satellites associated with one or more GNSS systems.
  • GNSS systems include, but are not limited to, the US-based Global Positioning System (GPS), the Russia-based Global Navigation Satellite System (GLONASS), the China-based BeiDou Navigation Satellite System (BDS), and the Europe-based Galileo GNSS.
  • GPS Global Positioning System
  • GLONASS Russia-based Global Navigation Satellite System
  • BDS BeiDou Navigation Satellite System
  • Galileo GNSS Europe-based Galileo GNSS
  • Storage device 1230 may be a non-volatile and/or non-transitory and/or computer-readable memory device and may be a hard disk or other types of computer readable media which may store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, a floppy disk, a flexible disk, a hard disk, magnetic tape, a magnetic strip/stripe, any other magnetic storage medium, flash memory, memristor memory, any other solid-state memory, a compact disc read only memory (CD-ROM) optical disc, a rewritable compact disc (CD) optical disc, digital video disk (DVD) optical disc, a blu- ray disc (BDD) optical disc, a holographic optical disk, another optical medium, a secure digital (SD) card, a micro secure digital (microSD) card, a Memory Stick® card, a smartcard chip, a EMV chip, a subscriber
  • SD secure digital
  • microSD micro secure digital
  • the storage device 1230 may include software services, servers, services, etc., that when the code that defines such software is executed by the processor 1210, it causes the system to perform a function.
  • a hardware service that performs a particular function may include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 1210, connection 1205, output device 1235, etc., to carry out the function.
  • computer- readable medium includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data.
  • a computer-readable medium may include a non- transitory medium in which data may be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections.
  • Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), Qualcomm Docket No.2400178WO flash memory, memory or memory devices.
  • a computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements.
  • a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
  • Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like.
  • circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the aspects in unnecessary detail.
  • well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects.
  • those of skill in the art will appreciate that the various illustrative logical Qualcomm Docket No.2400178WO blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality.
  • a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.
  • a process corresponds to a function
  • its termination may correspond to a return of the function to the calling function or the main function.
  • Processes and methods according to the above-described examples may be implemented using computer-executable instructions that are stored or otherwise available from computer-readable media. Such instructions may include, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used may be accessible over a network.
  • the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code.
  • Examples of computer- readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on. Qualcomm Docket No.2400178WO [0291]
  • the computer-readable storage devices, mediums, and memories may include a cable or wireless signal containing a bitstream and the like.
  • non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
  • Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques.
  • data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof, in some cases depending in part on the particular application, in part on the desired design, in part on the corresponding technology, etc.
  • the various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed using hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors.
  • the program code or code segments to perform the necessary tasks may be stored in a computer-readable or machine-readable medium.
  • a processor(s) may perform the necessary tasks. Examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also may be embodied in peripherals or add-in cards. Such functionality may also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.
  • the instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure.
  • the techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Qualcomm Docket No.2400178WO Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices.
  • the techniques may be realized at least in part by a computer- readable data storage medium comprising program code including instructions that, when executed, performs one or more of the methods, algorithms, and/or operations described above.
  • the computer-readable data storage medium may form part of a computer program product, which may include packaging materials.
  • the computer-readable medium may comprise memory or data storage media, such as random access memory (RAM) such as synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, and the like.
  • RAM random access memory
  • SDRAM synchronous dynamic random access memory
  • ROM read-only memory
  • NVRAM non-volatile random access memory
  • EEPROM electrically erasable programmable read-only memory
  • FLASH memory magnetic or optical data storage media, and the like.
  • the techniques additionally, or alternatively, may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that may be accessed, read, and/or executed by a computer, such as propagated signals or waves.
  • the program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • Such a processor may be configured to perform any of the techniques described in this disclosure.
  • a general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein.
  • Coupled to or “communicatively coupled to” refers to any component that is physically connected to another component either directly or indirectly, and/or any component that is in communication with another component (e.g., connected to the other component over a wired or wireless connection, and/or other suitable communication interface) either directly or indirectly.
  • Claim language or other language reciting “at least one of” a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim. For example, claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B.
  • claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, A and B and C, or any duplicate information or data (e.g., A and A, B and B, C and C, A and A and B, and so on), or any other ordering, duplication, or combination of A, B, and C.
  • the language “at least one of” a set and/or “one or more” of a set does not limit the set to the items listed in the set.
  • claim language reciting “at least one of A and B” or “at least one of A or B” may mean A, B, or A and B, and may additionally include items not listed in the set of A and B.
  • the phrases “at least one” and “one or more” are used interchangeably herein.
  • Claim language or other language reciting “at least one processor configured to,” “at least one processor being configured to,” “one or more processors configured to,” “one or more processors being configured to,” or the like indicates that one processor or multiple processors (in any combination) can perform the associated operation(s).
  • claim language reciting “at least one processor configured to: X, Y, and Z” means a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each tasked with a certain subset of operations X, Y, and Z such that together the multiple processors perform X, Y, and Z; or that a group of multiple processors work together to perform operations X, Y, and Z.
  • claim Qualcomm Docket No.2400178WO language reciting “at least one processor configured to: X, Y, and Z” can mean that any single processor may only perform at least a subset of operations X, Y, and Z.
  • one element may perform all functions, or more than one element may collectively perform the functions.
  • each function need not be performed by each of those elements (e.g., different functions may be performed by different elements) and/or each function need not be performed in whole by only one element (e.g., different elements may perform different sub-functions of a function).
  • one element may be configured to cause the other element to perform all functions, or more than one element may collectively be configured to cause the other element to perform the functions.
  • an entity e.g., any entity or device described herein
  • the entity may be configured to cause one or more elements (individually or collectively) to perform the functions.
  • the one or more components of the entity may include at least one memory, at least one processor, at least one communication interface, another component configured to perform one or more (or all) of the functions, and/or any combination thereof.
  • the entity may be configured to cause one component to perform all functions, or to cause more than one component to collectively perform the functions.
  • each function need not be performed by each of those components (e.g., different functions may be performed by different components) and/or each function need not be performed in whole by only one component (e.g., different components may perform different sub-functions of a function).
  • Illustrative aspects of the disclosure include: [0305] Aspect 1.
  • An apparatus configured to process one or more audio frames, the apparatus comprising: one or more memories configured to store the one or more audio frames; and one or more processors coupled to the one or more memories, the one or more processors being configured to: obtain an audio frame included in a sequence of the one Qualcomm Docket No.2400178WO or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit
  • Aspect 2 The apparatus of Aspect 1, wherein, to process the audio frame using the selected neural network-based decoder model, the one or more processors are configured to: switch from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, wherein the first neural network-based decoder model is included in the plurality of neural network-based decoder models.
  • Aspect 3 The apparatus of Aspect 2, wherein: the corresponding complexity of the selected neural network-based decoder model is greater than a first complexity associated with the first neural network-based decoder model.
  • Aspect 5 The apparatus of any of Aspects 1 to 4, wherein: the corresponding complexity is an increased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the apparatus, or a processing time of the previously decoded frame being less than the configured processing time completion limit. [0310] Aspect 6.
  • Aspect 7 The apparatus of any of Aspects 1 to 6, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain device load information indicative of the resource utilization of the apparatus, wherein the device load information is obtained from a resource monitor of the apparatus.
  • Aspect 9 The apparatus of any of Aspects 1 to 8, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain device load information indicative of the resource utilization of the apparatus, wherein the device load information is obtained from a resource monitor of the apparatus; and obtain one or more decoding time measurements corresponding to the previously decoded frame, wherein the one or more decoding time measurements are obtained from a decoder system of the apparatus.
  • Aspect 10 Aspect 10.
  • Aspect 11 The apparatus of any of Aspects 1 to 9, wherein the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame.
  • Aspect 11 The apparatus of Aspect 10, wherein the one or more decoding time measurements include one or more of a decoder start time associated with decoding the previously decoded frame, a decoder end time associated with decoding the previously decoded frame, or a decoder elapsed time associated with decoding the previously decoded frame.
  • Aspect 12 The apparatus of any of Aspects 1 to 11, wherein the monitoring information is indicative of a respective one or more decoding time measurements corresponding to each previously decoded frame of multiple previously decoded frames included in the sequence.
  • Aspect 13 Aspect 13
  • Aspect 14 The apparatus of any of Aspects 1 to 13, wherein the monitoring information is indicative of a current utilization percentage associated with the one or more processors.
  • Aspect 15 The apparatus of any of Aspects 1 to 14, wherein the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the apparatus.
  • CPU central processing unit
  • DSP digital signal processor
  • Aspect 17 The apparatus of any of Aspects 1 to 15, wherein the monitoring information is indicative of a memory usage associated with the apparatus.
  • Aspect 17 The apparatus of any of Aspects 1 to 16, wherein the resource utilization of the apparatus is an instantaneous resource utilization.
  • Aspect 18 The apparatus of any of Aspects 1 to 17, wherein the resource utilization of the apparatus is an average resource utilization over a configured length of time.
  • Aspect 19 The apparatus of any of Aspects 1 to 15, wherein the monitoring information is indicative of a memory usage associated with the apparatus.
  • the one or more processors are configured to: determine an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model; determine a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time; and skip processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold.
  • Aspect 21 The apparatus of any of Aspects 1 to 19, wherein the configured processing time completion limit is equal to a periodicity between consecutive frames included in the sequence of the one or more audio frames.
  • Aspect 21 The apparatus of any of Aspects 1 to 20, wherein the configured processing time completion limit is equal to a duration of each audio frame included in Qualcomm Docket No.2400178WO the sequence of the one or more audio frames, and wherein the duration of each audio frame is 20 milliseconds.
  • Aspect 22 The apparatus of any of Aspects 1 to 21, wherein the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames.
  • Aspect 24 The apparatus of any of Aspects 1 to 22, wherein the apparatus comprises an audio decoder or a voice decoder of a signal synthesis system.
  • Aspect 24 The apparatus of any of Aspects 1 to 23, further comprising a microphone configured to obtain the one or more audio frames.
  • Aspect 25 The apparatus of any of Aspects 1 to 24, further comprising one or more speakers configured to output the decoded audio frame within the configured processing time completion limit.
  • Aspect 26 The apparatus of any of Aspects 1 to 25, wherein: the selected neural network-based decoder model is a selected neural speech synthesizer model; and the selected neural network-based decoder model is selected from a plurality of neural speech synthesizer neural network models.
  • Aspect 27 The apparatus of any of Aspects 1 to 26, wherein: the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis.
  • Aspect 28 The apparatus of any of Aspects 1 to 27, wherein: the selected neural network-based decoder model is a selected neural network model corresponding to a neural filter estimator of a neural homomorphic vocoder (NHV)-based neural speech synthesizer.
  • Aspect 29 Aspect 29.
  • the selected neural network-based decoder model is a selected linear predictive coding (LPC) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of LPC-based neural network models for neural speech synthesis.
  • LPC linear predictive coding
  • Aspect 32 The apparatus of any of Aspects 1 to 30,wherein the selected neural network-based decoder model is a selected neural network model corresponding to a sample rate network of a linear predictive coding (LPC)-based neural speech synthesizer.
  • LPC linear predictive coding
  • a method for processing one or more audio frames comprising: obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of a decoder; determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models each having a respective complexity; processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and outputting the decoded audio frame within the configured processing time completion limit.
  • Aspect 33 The method of Aspect 32, wherein processing the audio frame using the selected neural network-based decoder model comprises: switching from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, wherein the first neural network-based decoder model is included in the plurality of neural network-based decoder models.
  • Aspect 34 The method of Aspect 33, wherein: the corresponding complexity of the selected neural network-based decoder model is greater than a first complexity associated with the first neural network-based decoder model.
  • Aspect 36 The method of any of Aspects 32 to 35, wherein: the corresponding complexity is an increased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the decoder, or a processing time of the previously decoded frame being less than the configured processing time completion limit. [0341] Aspect 37.
  • Aspect 38 The method of any of Aspects 32 to 37, wherein obtaining the monitoring information comprises: obtaining device load information indicative of the resource utilization of the decoder, wherein the device load information is obtained from a resource monitor of the decoder.
  • Aspect 40 The method of any of Aspects 32 to 39, wherein obtaining the monitoring information comprises: obtaining device load information indicative of the resource utilization of the decoder, wherein the device load information is obtained from a resource monitor of the decoder; and obtaining one or more decoding time measurements corresponding to the previously decoded frame..
  • Aspect 41 The method of any of Aspects 32 to 40, wherein the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame.
  • Aspect 43 The method of any of Aspects 32 to 42, wherein the monitoring information is indicative of a respective one or more decoding time measurements corresponding to each previously decoded frame of multiple previously decoded frames included in the sequence.
  • Aspect 45 The method of any of Aspects 32 to 44, wherein the monitoring information is indicative of a current utilization percentage associated with the one or more processors.
  • Aspect 46 The method of any of Aspects 32 to 45, wherein the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the decoder.
  • Aspect 47 The method of any of Aspects 32 to 43, wherein the previously decoded frame is decoded using a neural network-based decoder model of the plurality of neural network-based decoder models different from the selected neural network-based decoder model.
  • Aspect 48 The method of any of Aspects 32 to 47, wherein the resource utilization of the decoder is an instantaneous resource utilization.
  • Aspect 48 The method of any of Aspects 32 to 47, wherein the resource utilization of the decoder is an average resource utilization over a configured length of time.
  • Aspect 50 The method of any of Aspects 32 to 46, wherein the monitoring information is indicative of a memory usage associated with the decoder.
  • processing the audio frame using the selected neural network-based decoder model comprises: determining an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model; determining a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time; and skipping processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold.
  • Aspect 52 The method of any of Aspects 32 to 51, wherein the configured processing time completion limit is equal to a duration of each audio frame included in the sequence of the one or more audio frames, and wherein the duration of each audio frame is 20 milliseconds.
  • Aspect 53 The method of any of Aspects 32 to 52, wherein the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames.
  • Aspect 55 The method of any of Aspects 32 to 54, wherein: the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis.
  • the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis.
  • the selected neural network-based decoder model is a selected neural network model corresponding to a neural filter estimator of a neural homomorphic vocoder (NHV)-based neural speech synthesizer.
  • Aspect 57 The method of any of Aspects 32 to 56, wherein: the selected neural network-based decoder model is a selected linear predictive coding (LPC) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of LPC-based neural network models for neural speech synthesis.
  • LPC linear predictive coding
  • the selected neural network-based decoder model is a selected neural network model corresponding to a frame rate network of a linear predictive coding (LPC)-based neural speech synthesizer.
  • LPC linear predictive coding
  • a non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit.
  • Aspect 61 A method for processing one or more audio samples, comprising performing operations according to any of Aspects 1 to 31.
  • Aspect 62. A non-transitory computer-readable storage medium comprising instructions stored thereon which, when executed by at least one processor, causes the at least one processor to perform operations according to any of Aspects 1 to 59.
  • Aspect 63. An apparatus for processing one or more audio samples, the apparatus comprising one or more means for performing operations according to any of Aspects 1 to 60.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Systems and techniques are provided for processing audio data. For example, a process can include obtaining an audio frame included in a sequence of audio frames associated with a configured processing time completion limit. Monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus can be obtained. Based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit can be selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity. The audio frame can be processed using the selected neural network-based decoder model to thereby generate a decoded audio frame, and the decoded audio frame can be output within the configured processing time completion limit.

Description

Qualcomm Docket No.2400178WO SIGNAL SYNTHESIS USING DYNAMIC DECODER COMPLEXITY SCALING BASED ON DEVICE LOAD FIELD [0001] Aspects of the present disclosure generally relate to audio coding (e.g., audio encoding and/or decoding). In some implementations, examples are described for performing audio coding using dynamic decoder complexity scaling based on device load information. BACKGROUND [0002] Audio coding (also referred to as voice coding and/or speech coding) is a technique used to represent a digitized audio signal using as few bits as possible (thus compressing the speech data), while attempting to maintain a certain level of audio quality. An audio or voice encoder is used to encode (or compress) the digitized audio (e.g., speech, music, etc.) signal to a lower bit-rate stream of data. The lower bit-rate stream of data can be input to an audio or voice decoder, which decodes the stream of data and constructs an approximation or reconstruction of the original signal. The audio or voice encoder-decoder structure can be referred to as an audio coder (or voice coder or speech coder) or an audio/voice/speech coder-decoder (codec). [0003] Audio coders exploit the fact that speech signals are highly correlated waveforms. Some speech coding techniques are based on a source-filter model of speech production, which assumes that the vocal cords are the source of spectrally flat sound (an excitation signal), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech. The different phonemes (e.g., vowels, fricatives, and voice fricatives) can be distinguished by their excitation (source) and spectral shape (filter). SUMMARY [0004] The following presents a simplified summary relating to one or more aspects disclosed herein. Thus, the following summary should not be considered an extensive overview relating to all contemplated aspects, nor should the following summary be considered to identify key or critical elements relating to all contemplated aspects or to delineate the scope associated with any particular aspect. Accordingly, the following summary has the sole purpose to present certain concepts relating to one or more aspects Qualcomm Docket No.2400178WO relating to the mechanisms disclosed herein in a simplified form to precede the detailed description presented below. [0005] Disclosed are systems, methods, apparatuses, and computer-readable media for audio coding (e.g., encoding and/or decoding audio data). According to at least one illustrative example, an apparatus for processing audio data is provided. The apparatus includes at least one memory and at least one processor coupled to the at least one memory and configured to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit. [0006] In another example, a method for processing audio data is provided. The method includes: obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of a decoder; determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models each having a respective complexity; processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and outputting the decoded audio frame within the configured processing time completion limit. Qualcomm Docket No.2400178WO [0007] In another example, a non-transitory computer-readable medium is provided that includes instructions that, when executed by at least one processor, cause the at least one processor to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit. [0008] In another example, an apparatus for processing audio data is provided. The apparatus includes: means for obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; means for obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; means for determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; means for processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and means for outputting the decoded audio frame within the configured processing time completion limit. [0009] Aspects generally include a method, apparatus, system, computer program product, non-transitory computer-readable medium, user equipment, base station, wireless communication device, and/or processing system as substantially described herein with reference to and as illustrated by the drawings and specification. Qualcomm Docket No.2400178WO [0010] The foregoing has outlined rather broadly the features and technical advantages of examples according to the disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter. The conception and specific examples disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Such equivalent constructions do not depart from the scope of the appended claims. Characteristics of the concepts disclosed herein, both their organization and method of operation, together with associated advantages, will be better understood from the following description when considered in connection with the accompanying figures. Each of the figures is provided for the purposes of illustration and description, and not as a definition of the limits of the claims. [0011] While aspects are described in the present disclosure by illustration to some examples, those skilled in the art will understand that such aspects may be implemented in many different arrangements and scenarios. Techniques described herein may be implemented using different platform types, devices, systems, shapes, sizes, and/or packaging arrangements. For example, some aspects may be implemented via integrated chip implementations or other non-module-component based devices (e.g., end-user devices, vehicles, communication devices, computing devices, industrial equipment, retail/purchasing devices, medical devices, and/or artificial intelligence devices). Aspects may be implemented in chip-level components, modular components, non-modular components, non-chip-level components, device-level components, and/or system-level components. Devices incorporating described aspects and features may include additional components and features for implementation and practice of claimed and described aspects. For example, transmission and reception of wireless signals may include one or more components for analog and digital purposes (e.g., hardware components including antennas, radio frequency (RF) chains, power amplifiers, modulators, buffers, processors, interleavers, adders, and/or summers). It is intended that aspects described herein may be practiced in a wide variety of devices, components, systems, distributed arrangements, and/or end-user devices of varying size, shape, and constitution. [0012] Other objects and advantages associated with the aspects disclosed herein will be apparent to those skilled in the art based on the accompanying drawings and detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of Qualcomm Docket No.2400178WO the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this patent, any or all drawings, and each claim. [0013] The foregoing, together with other features and aspects, will become more apparent upon referring to the following specification, claims, and accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS [0014] The accompanying drawings are presented to aid in the description of various aspects of the disclosure and are provided solely for illustration of the aspects and not limitation thereof. [0015] FIG.1 is a block diagram illustrating an example speech processing system, in accordance with some examples; [0016] FIG. 2A is a block diagram illustrating an example feature generator, in accordance with some examples; [0017] FIG.2B is a block diagram illustrating an example of a voice coding system, in accordance with some examples; [0018] FIG. 3 is a block diagram illustrating an example of a voice coding signal synthesis system utilizing a linear time-varying filter generated using a neural network model and a separate linear predictive coding (LPC) filter, in accordance with some examples; [0019] FIG.4 is a block diagram illustrating components of a user equipment (UE), in accordance with some examples; [0020] FIG. 5 is a block diagram illustrating an example of a dynamic decoder complexity switching system that may be used with a decoder of a signal synthesis system, in accordance with some examples; [0021] FIG. 6A is a diagram illustrating a plurality of example decoder architectures associated with a respective synthesis complexity, in accordance with some examples; [0022] FIG.6B is a diagram illustrating example decoder architectures each associated with a respective synthesis complexity, in accordance with some examples; Qualcomm Docket No.2400178WO [0023] FIG. 7 is a diagram illustrating an example of dynamic decoder complexity adjustment based on determining an elapsed execution time for the current frame after one processing by one or more subsets of a plurality of decoder layers, in accordance with some examples; [0024] FIG.8 is a block diagram illustrating an example of an audio codec system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer, in accordance with some examples; [0025] FIG.9 is a block diagram illustrating an example of a decoder of a voice coding signal synthesis system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer comprising a neural homomorphic vocoder (NHV), in accordance with some examples; [0026] FIG.10 is a block diagram illustrating an example of a decoder of a voice coding signal synthesis system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer comprising a linear predictive coding (LPC) network, in accordance with some examples; [0027] FIG. 11 is a flow chart illustrating an example of a process for processing one or more audio samples, in accordance with some examples; and [0028] FIG. 12 is a block diagram illustrating an example of a computing system for implementing certain aspects described herein. DETAILED DESCRIPTION [0029] Certain aspects of this disclosure are provided below for illustration purposes. Alternate aspects may be devised without departing from the scope of the disclosure. Additionally, well-known elements of the disclosure will not be described in detail or will be omitted so as not to obscure the relevant details of the disclosure. Some of the aspects described herein may be applied independently and some of them may be applied in combination as would be apparent to those of skill in the art. In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of aspects of the application. However, it will be apparent that various aspects may be practiced without these specific details. The figures and description are not intended to be restrictive. [0030] The ensuing description provides example aspects only, and is not intended to Qualcomm Docket No.2400178WO limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the example aspects will provide those skilled in the art with an enabling description for implementing an example aspect. It should be understood that various changes may be made in the function and arrangement of elements without departing from the scope of the application as set forth in the appended claims. [0031] Audio coding (e.g., speech coding, music signal coding, or other type of audio coding) can be performed on a digitized audio signal (e.g., a speech signal) to compress the amount of data for storage, transmission, and/or various other uses. A voice coding system can be an audio coding system that is used to perform audio coding on a speech signal that represents the speech (e.g., voiced words or sounds) uttered by a user of the system. A voice coding system can also be referred to as a voice or speech coder or a voice coder-decoder (codec). A voice coding system may include one or more encoders (e.g., voice encoders) and one or more decoders (e.g., voice decoders). For example, a voice encoder can be used to process an input speech signal. Input speech signals may include a digitized speech signal generated from an analog speech signal from a given source, where the resulting digitized speech signal is a discrete-time speech signal with sample values (e.g., also referred to as samples or audio samples) that are also discretized. [0032] Voice coders (e.g., voice encoders and/or voice decoders) can be designed and implemented to exploit the fact that speech signals are highly correlated waveforms. The samples of an input speech signal can be divided into blocks of N samples each, where a block of N samples is referred to as a frame. In some examples, each frame can be 10-20 milliseconds (ms) in length. A voice encoder can generate a compressed signal corresponding to the input, digitized speech signal. The compressed signal generated by the voice encoder can include a lower bitrate stream of audio data that represents the speech signal using as few bits as possible, while attempting (e.g., by the voice encoder) to maintain a certain quality level for the speech. The compressed signal generated by the voice encoder can have a lower bitrate than the uncompressed digitized speech signal provided as input to the voice encoder. In some cases, voice encoders may be used to reduce the bitrate of an input speech signal. The bitrate of a signal is based on the sampling frequency (e.g., the sampling frequency of a microphone used to sample a user’s speech) and the number of bits per sample. For example, the bitrate of a signal (BR) can be equal to S*b, where S represents the sampling frequency and b represents the number of bits per sample. Qualcomm Docket No.2400178WO [0033] The output of a voice encoder may be a compressed speech signal. The compressed speech signal can be stored and/or transmitted to and processed by a corresponding voice decoder. The corresponding voice decoder can be a voice decoder that is associated with the voice encoder and/or a voice decoder that is configured to decode the compressed speech signals generated by the voice encoder. In some cases, a voice decoder may communicate with a voice encoder, for example to request speech data, to transmit feedback information, and/or to provide other communications to or from the voice encoder and voice decoder. The voice decoder can be configured to decode the data of the compressed speech signal and construct a reconstructed speech signal that approximates the original speech signal. For example, the reconstructed speech signal may include a digitized, discrete-time signal with a same or similar bitrate as the bitrate of the original speech signal. In some cases, the voice decoder may implement an inverse of a voice coding algorithm used by the voice encoder. [0034] In some examples, a voice decoder (e.g., also referred to as a speech decoder) may be included in a voice calling system and/or a voice call processing system. For example, a voice call processing system can be used to provide voice communication services between different users and/or different devices. Voice call processing systems can be used for telephony (e.g., audio-only calling), including 4G LTE voice calls (e.g., Voice over LTE (VoLTE)), 5G NR voice calls (e.g., Voice over New Radio (VoNR)), Voice over Internet Protocol (VoIP) voice calls, etc. Voice call processing systems may also be used for video calls, where the voice call processing system is used for the audio portion of the video call and may be separate from a video coding system used for the video portion of the video call. In some cases, a voice call processing system can be implemented as a real-time system that is configured to manage the end-to-end flow of voice data packets between communicating parties. A voice call processing system can implement various protocols, algorithms, and/or audio processing techniques to enhance a voice call signal and/or to meet a realo-time delivery requirement (e.g., a real-time processing requirement) for the voice data packets of the voice call. [0035] Voice call processing can be a time-sensitive and/or time-constrained process. A decoder may receive audio frames and/or encoded representations thereof from an encoder, with a periodicity that is the same as or similar to the periodicity of the input frames of audio data to the encoder. For example, an encoder may receive a stream of audio data (e.g., an input speech signal) comprising a plurality of 20ms frames. The Qualcomm Docket No.2400178WO encoder may generate features or other encoded representations of a respective 20ms frame, and transmit the features or other encoded information to the decoder. The decoder may receive the features or encoded information at approximately the same 20ms periodicity that is associated with the plurality of 20ms audio frames at the input of the encoder. For example, at time 0 a decoder may receive an encoded representation of a first speech frame. 20ms later, the decoder may receive an encoded representation of a second speech frame. An additional 20ms later, the decoder may receive an encoded representation of a third speech frame, etc. [0036] A real-time processing requirement for speech decoders and/or speech synthesizers that receive a stream of encoded audio or speech frames may correspond to the length (e.g., duration) of each frame and/or may correspond to the periodicity between consecutive frames of the stream. For example, to provide real-time processing and playback of the decoded or synthesized audio, the decoder should complete processing (e.g., decoding and/or synthesis processing) of a respective speech frame before the next speech frame is received. Processing of the next frame cannot begin until processing has completed for the current frame. Exceeding the real-time processing time limit for the current frame may cause the decoder to delay the beginning of processing of the next frame (e.g., the decoder starts processing the next frame late). [0037] In examples where the decoder receives frames every 20ms (e.g., corresponding to a frame length or frame duration of 20ms), the real-time requirement may correspond to the decoder needing to finish decoding and/or synthesis processing within 20ms after reception of each frame. If decoding and/or synthesis processing for a respective frame takes longer than the frame duration (e.g., 20ms), delay may accumulate at the decoder and for the playback of the decoded or synthesized audio. [0038] In some examples, one or more machine learning systems, networks, models, etc., can be used to perform audio coding (e.g., encoding and/or decoding of one or more audio samples). For example, machine learning systems using one or more neural network models can be used to implement a voice coder to encode and/or decode voice data (e.g., speech signals, etc.). In some cases, machine learning systems (e.g., using one or more neural network models) can be used to generate reconstructed voice or audio signals using a process of neural synthesis. For example, using features extracted from one or more frames of audio data, a neural network-based voice decoder can generate coefficients for Qualcomm Docket No.2400178WO one or more linear filters, learned filters, etc., that may be used to perform voice decoding. In some cases, a neural network-based voice decoder can be trained to estimate the parameters (e.g., linear filter coefficients, etc.) of a speech synthesis pipeline, where learned filters that are generated or tuned using a neural network-based voice decoder may subsequently be used to generate a reconstructed speech signal (e.g., a synthesized speech signal, a speech synthesis signal, etc.). [0039] Neural network-based voice or speech decoders may also be referred to as neural decoders. Neural decoders can be more computationally complex to implement than non- neural or non-machine learning-based voice or speech decoders. For example, a neural decoder may be associated with a longer frame processing time (e.g., a longer frame decoding time) than a non-neural decoder, when the neural decoder and the non-neural decoder are implemented or executed on the same computational device, platform, hardware, etc. In some cases, the use of neural decoders with resource-constrained devices such as UEs, smartphones, mobile computing devices, wearables, etc., may also be associated with increased frame processing times by the neural decoder, including increased frame processing times beyond the real-time processing requirement associated with the frame length (e.g., 20ms speech frames, etc.). [0040] There is a need for systems and techniques that can be used to implement neural decoding and/or neural network-based signal synthesis within a frame duration associated with a real-time processing requirement. For example, there is a need for systems and techniques that can be used to provide neural decoders configured to complete speech frame processing within the 20ms real-time requirement corresponding to the 20ms duration of the speech frames. There is a further need for systems and techniques that can be used to provide dynamic decoder and/or signal synthesis complexity scaling to complete decoding of a frame within the frame duration or other configured real-time processing requirement, where the dynamic complexity scaling is based on current device load information of one or more processors used to implement the decoder. [0041] Systems, apparatuses, methods (also referred to as processes), and computer- readable media (collectively referred to herein as “systems and techniques”) are described herein that can be used to provide dynamic complexity scaling for a neural network-based decoder and/or signal synthesis engine configured to process one or more frames of data (e.g., audio data, voice data, speech data, video data, etc.). In some aspects, the frames of Qualcomm Docket No.2400178WO data processed by the decoder can be associated with a real-time processing requirement (e.g., a real-time processing time limit) for the decoder to finish decoding and/or processing of a respective frame. For example, the one or more frames of data can be speech frames, where the real-time processing requirement is equal to the length of each speech frame (e.g., 20ms). [0042] In some examples, the systems and techniques can be used to provide dynamic decoder complexity scaling to finish processing of a respective frame of audio data within a configured processing time limit (e.g., the real-time processing requirement for the frame of audio data, the duration of each respective frame of audio data, etc.). In some cases, the dynamic decoder complexity scaling can be based on device load information corresponding to a UE or other computing device used to implement the neural network- based decoder. In some cases, the dynamic decoder complexity scaling can be based on measured runtime information corresponding to one or more previously decoded frames processed by the neural network-based decoder or signal synthesizer. For example, the measured runtime information can be indicative of an elapsed execution time for decoding or processing the one or more previously decoded frames. [0043] Device load information can be used to dynamically reduce the neural network- based decoder or synthesizer engine complexity based on a determination of a relatively high current device load state. The device load information can be used to dynamically increase the neural network-based decoder or synthesizer engine complexity based on a determination of a relatively low current device load state. In some aspects, the device load information can be used to dynamically reduce or increase the decoder complexity so that the decoder remains within the configured time limit for processing (e.g., decoding, synthesizing, etc.) each respective frame. For example, the device load information can be used to dynamically reduce or increase the decoder complexity so that the decoder remains within a 20ms real-time processing requirement associated with a stream of 20ms speech frames provided to the decoder. [0044] In some examples, the device load information can be obtained and/or determined by a device resource monitor engine configured to monitor the hardware resource utilization of a UE or other computing device used to implement the neural network-based decoder. For example, the device load information can include a current central processing unit (CPU) usage or utilization percentage and/or can include a current Qualcomm Docket No.2400178WO usage or utilization percentage for various other processors included in the device. The processor utilization information may be instantaneous utilization information corresponding to the current state of the CPU or other processor. In some cases, the processor utilization information can be an average utilization percentage over a configured time window. For example, the device load information can include or otherwise be indicative of an average CPU usage over the previous 5ms, 10ms, 15ms, etc. In some examples, the device load information can include a memory (e.g., random- access memory (RAM), processor cache, video RAM (VRAM), etc.) usage of the UE or other computing device used to implement the neural network-based decoder. The memory usage information can be indicative of an instantaneous memory percentage in use, an average memory percentage used over a configured time window (e.g., average memory usage over the previous 5ms, 10ms, 15ms, etc.), and/or a combination thereof. [0045] In some examples, the device load information can be obtained from a device resource monitor or device resource manager that is included in and/or implemented by the UE or other computing device that implements the neural network-based decoder. In some cases, the device load information can be obtained from a performance monitoring or profiling library included in and/or implemented by an operating system of the device that implements the decoder. In some examples, the systems and techniques may monitor CPU state information and/or CPU statistics information to determine a dynamic complexity scaling adjustment for processing or decoding one or more upcoming frames at the neural network-based decoder or other neural network-based signal synthesis engine. [0046] The device load information can be monitored, collected, obtained, and/or analyzed in real-time by a decoder complexity switching engine configured to generate an output indicative of a scaling adjustment corresponding to increased decoder complexity, a scaling adjustment corresponding to decreased decoder complexity, and/or a scaling adjustment corresponding to the same decoder complexity. The decoder complexity switching engine can be included in the neural network-based decoder, can be implemented by the neural network-based decoder, and/or can be associated with the neural network-based decoder. [0047] In some examples, a complexity scaling indication can be generated and/or determined by the decoder complexity switching engine and may be used to configure a Qualcomm Docket No.2400178WO switch coupled between an input bitstream received at the decoder and a plurality of different decoder model instances, where each decoder model instance of the plurality of decoder model instances is associated with a respective synthesis complexity. In some examples, each decoder model instance of the plurality of decoder model instances is associated with a different respective synthesis complexity. For example, a first subset of the plurality of decoder model instances may be associated with a relatively high synthesis complexity, a second subset of the plurality of decoder model instances may be associated with a relatively low synthesis complexity, etc. [0048] In some cases, the different decoder model instances can implement different complexity (e.g., synthesis complexity, decoding complexity, processing or processing time complexity, etc.) based on the respective decoder model instances utilizing different neural network and/or machine learning architectures. For example, a first neural network architecture may be used to implement a first level of complexity, a second neural network architecture may be used to implement a second level of complexity, etc. In some cases, the different decoder model instances can implement different complexity based on the respective decoder model instances utilizing different neural network or machine learning weights or parameters. For example, a first set of neural network weights can be used to parameterize a neural network architecture to a first level of complexity, a second set of neural network weights can be used to parameter a same or different neural network architecture to a second level of complexity, etc. [0049] In some examples, the different complexity decoder model instances may share one or more neural network layers and can scale in complexity based on combining the shared neural network layers with one or more complexity level-specific subsets of corresponding neural network layers. For example, a first complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the first complexity level. A second complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the second complexity level, etc. [0050] In some cases, lower complexity decoders can be implemented as subsets of the plurality of neural network layers include in the highest complexity decoder model instance. For example, one or more neural network layers and/or subsets of neural network layers of the highest complexity decoder model instance can be deactivated to Qualcomm Docket No.2400178WO implement lower complexity level decoder instances. In some examples, lower complexity decoders can be implemented by bypassing one or more neural network layers, subsets of neural network layers, components, processing steps, etc., of the highest complexity decoder model instance at the device. [0051] Further aspects of the systems and techniques will be described with respect to the figures. [0052] FIG.1 illustrates an example implementation of a system-on-a-chip (SoC) 100, which may include a central processing unit (CPU) 102 or a multi-core CPU, configured to perform one or more of the functions described herein. Parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with a computational device (e.g., neural network with weights), delays, frequency bin information, task information, among other information may be stored in a memory block associated with a neural processing unit (NPU) 108, in a memory block associated with a CPU 102, in a memory block associated with a graphics processing unit (GPU) 104, in a memory block associated with a digital signal processor (DSP) 106, in a memory block 118, and/or may be distributed across multiple blocks. Instructions executed at the CPU 102 may be loaded from a program memory associated with the CPU 102 or may be loaded from a memory block 118. [0053] The SoC 100 may also include additional processing blocks tailored to specific functions, such as a GPU 104, a DSP 106, a connectivity block 110, which may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, and the like, and a multimedia processor 112 that may, for example, detect and recognize gestures, speech, and/or other interactive user action(s) or input(s). In one implementation, the NPU 108 is implemented in the CPU 102, DSP 106, and/or GPU 104. The SoC 100 may also include a sensor processor 114, image signal processors (ISPs) 116, and/or a signal synthesis system 120. For example, the signal synthesis system 120 can be implemented or configured as a speech synthesis system, including a neural speech decoder and/or a neural homomorphic vocoder (NHV) system, which can be used to generate speech (e.g., perform speech synthesis), can be implemented in a text-to-speech (TTS) system, etc. In some examples, the sensor processor 114 can be associated with or connected to one or more sensors for providing sensor input(s) to sensor processor 114. For example, the one Qualcomm Docket No.2400178WO or more sensors and the sensor processor 114 can be provided in, coupled to, or otherwise associated with a same computing device. [0054] In some examples, the one or more sensors can include one or more microphones for receiving sound (e.g., an audio input), including sound or audio inputs that can be used to perform various speech synthesis tasks and/or to generate a reconstructed speech signal from the audio input, etc. In some cases, the sound or audio input received by the one or more microphones (and/or other sensors) may be digitized into data packets for analysis and/or transmission. The audio input may include ambient sounds in the vicinity of a computing device associated with the SoC 100 and/or may include speech from a user of the computing device associated with the SoC 100. In some cases, a computing device associated with the SoC 100 can additionally, or alternatively, be communicatively coupled to one or more peripheral devices (not shown) and/or configured to communicate with one or more remote computing devices or external resources, for example using a wireless transceiver and a communication network, such as a cellular communication network. [0055] SoC 100, DSP 106, NPU 108 and/or signal synthesis (e.g., neural speech decoder, NHV, etc.) system 120 may be configured to perform audio signal processing. For example, the signal synthesis system 120 may be configured to perform steps for speech synthesis and/or neural homomorphic vocoding, etc. As another example, one or more portions of the steps, such as feature generation, for speech synthesis and/or NHV may be performed by the signal synthesis system 120 while the DSP 106/NPU 108 performs other steps, such as steps using one or more machine learning networks and/or machine learning techniques according to aspects of the present disclosure and as described herein. [0056] FIG. 2A depicts an example of a feature generator 200, in accordance with aspects of the present disclosure. It should be understood that many techniques may be used to generate feature vectors for an audio input and that feature generator 200 is just a single example of a technique that may be used to generate feature vectors. [0057] Feature generator 200 receives an audio signal at signal pre-processor 202. As above, the audio signal may be from an audio source of an electronic device, such a microphone. Signal pre-processor 202 may perform various pre-processing steps on the received audio signal. For example, signal pre-processor 202 may split the audio signal Qualcomm Docket No.2400178WO into parallel audio signals and delay one of the signals by a predetermined amount of time to prepare the audio signals for input into a Fast-Fourier Transform (FFT) circuit. As another example, signal pre-processor 202 may perform a windowing function, such as a Hamming, Hann, Blackman-Harris, Kaiser-Bessel window function, or other sine-based window function, which may improve the performance of further processing stages, such as signal domain transformer 204. Generally, a windowing (or window) function in may be used to reduce the amplitude of discontinuities at the boundaries of each finite sequence of received audio signal data to improve further processing. As another example, signal pre-processor 202 may convert the audio signal data from parallel to serial, or vice versa, for further processing. The pre-processed audio signal from the signal pre-processor 202 may be provided to signal domain transformer 204, which may transform the pre-processed audio signal from a first domain into a second domain, such as from a time domain into a frequency domain. [0058] In some aspects, signal domain transformer 204 implements a Fourier transform, such as a Fast-Fourier transform (FFT). For example, in some cases, the Fast Fourier transform may be a 16-band (or bin, channel, or point) FFT, which generates a compact feature set that may be efficiency processed by a model. In some cases, a Fourier transform provides fine spectral domain information about the incoming audio signal as compared to conventional single channel processing, such as conventional hardware SNR threshold detection. The result of signal domain transformer 204 is a set of audio features, such as a set of voltages, powers, or energies per frequency band in the transformed data. [0059] The set of audio features may then be provided to signal feature filter 206, which may reduce the size of or compress the feature set in the audio feature data. In some aspects, signal feature filter 206 may discard certain features from the audio feature set, such as symmetric or redundant features from multiple bands of a multi-band FFT. Discarding this data reduces the overall size of the data stream for further processing and may be referred to a compressing the data stream. For example, in some cases, a 16-band FFT may include 8 symmetric or redundant bands of after the powers are squared because audio signals are real. Thus, signal feature filter 206 may filter out the redundant or symmetric band information and output an audio feature vector 208. In some cases, output of the signal feature filter may be compressed or otherwise processed prior to output as the audio feature vector 208. The audio feature vector 208 may be provided to a speech synthesis system (e.g., such as the signal synthesis system 120 of FIG. 1) for processing Qualcomm Docket No.2400178WO by speech synthesis or NHV model. [0060] Audio coding (e.g., speech coding, music signal coding, or other type of audio coding) can be performed on a digitized audio signal (e.g., a speech signal) to compress the amount of data for storage, transmission, and/or other use. FIG.2B is a block diagram illustrating an example of a voice coding system 250 (which can also be referred to as a voice or speech coder or a voice coder-decoder (codec)). A voice encoder 252 of the voice coding system 250 can use a voice coding algorithm to process a speech signal 251. The speech signal 251 can include a digitized speech signal generated from an analog speech signal from a given source. For instance, the digitized speech signal can be generated using a filter to eliminate aliasing, a sampler to convert to discrete-time, and an analog- to-digital converter for converting the analog signal to the digital domain. The resulting digitized speech signal (e.g., speech signal 251) is a discrete-time speech signal with sample values (referred to herein as samples) that are also discretized. [0061] Using the voice coding algorithm, the voice encoder 252 can generate a compressed signal (including a lower bit-rate stream of data) that represents the speech signal 251 using as few bits as possible, while attempting to maintain a certain quality level for the speech. The voice encoder 252 can use any suitable voice coding algorithm, such as a linear prediction coding algorithm (e.g., Code-excited linear prediction (CELP), algebraic-CELP (ACELP), or other linear prediction technique) or other voice coding algorithm. [0062] The voice encoder 252 can compress the speech signal 251 in an attempt to reduce the bit-rate of the speech signal 251. The bit-rate of a signal is based on the sampling frequency and the number of bits per sample. For instance, the bit-rate of a speech signal can be determined as ^^^^ ൌ ^^ ∗ ^^, where BR is the bit-rate, S is the sampling frequency, and b is the number of bits per sample. In one illustrative example, at a sampling frequency (S) of 8 kilohertz (kHz) and at 16 bits per sample (b), the bit-rate (BR) of a signal would be a bit-rate of 128 kilobits per second (kbps). [0063] The compressed speech signal can then be stored and/or sent to and processed by a voice decoder 254. In some examples, the voice decoder 254 can communicate with the voice encoder 252, such as to request speech data, send feedback information, and/or provide other communications to the voice encoder 252. In some examples, the voice encoder 252 or a channel encoder can perform channel coding on the compressed speech Qualcomm Docket No.2400178WO signal before the compressed speech signal is sent to the voice decoder 254. For instance, channel coding can provide error protection to the bitstream of the compressed speech signal to protect the bitstream from noise and/or interference that can occur during transmission on a communication channel. [0064] The voice decoder 254 can decode the data of the compressed speech signal and construct a reconstructed speech signal 255 that approximates the original speech signal 251. The reconstructed speech signal 255 includes a digitized, discrete-time signal that can have the same or similar bit-rate as that of the original speech signal 251. The voice decoder 254 can use an inverse of the voice coding algorithm used by the voice encoder 252, which as noted above can include any suitable voice coding algorithm, such as a linear prediction coding algorithm (e.g., CELP, ACELP, or other suitable linear prediction technique) or other voice coding algorithm. In some cases, the reconstructed speech signal 255 can be converted to continuous-time analog signal, such as by performing digital-to- analog conversion and anti-aliasing filtering. [0065] Voice coders can exploit the fact that speech signals are highly correlated waveforms. The samples of an input speech signal can be divided into blocks of N samples each, where a block of N samples is referred to as a frame. In one illustrative example, each frame can be 10-20 milliseconds (ms) in length. Various voice coding algorithms can be used to encode a speech signal. For instance, code-excited linear prediction (CELP) is one example of a voice coding algorithm. The CELP model is based on a source-filter model of speech production, which assumes that the vocal cords are the source of spectrally flat sound (an excitation signal), and that the vocal tract acts as a filter to spectrally shape the various sounds of speech. The different phonemes (e.g., vowels, fricatives, and voice fricatives) can be distinguished by their excitation (source) and spectral shape (filter). [0066] In general, CELP uses a linear prediction (LP) model to model the vocal tract, and uses entries of a fixed codebook (FCB) as input to the LP model. For instance, long- term linear prediction can be used to model pitch of a speech signal, and short-term linear prediction can be used to model the spectral shape (phoneme) of the speech signal. Entries in the FCB are based on coding of a residual signal that remains after the long-term and short-term linear prediction modeling is performed. For example, long-term linear prediction and short-term linear prediction models can be used for speech synthesis, and Qualcomm Docket No.2400178WO a fixed codebook (FCB) can be searched during encoding to locate the best residual for input to the long-term and short-term linear prediction models. The FCB provides the residual speech components not captured by the short-term and long-term linear prediction models. A residual, and a corresponding index, can be selected at the encoder based on an analysis-by-synthesis process that is performed to choose the best parameters so as to match the original speech signal as closely as possible. The index can be sent to the decoder, which can extract the corresponding LTP residual from the FCB based on the index. [0067] FIG.3 is a diagram illustrating an example of a voice decoding signal synthesis system 300 utilizing a linear time-varying filter 304 with coefficients generated using a neural network (NN) filter estimator 302 and a separate linear predictive coding (LPC) filter 306. The voice decoding signal synthesis system 300 is configured to decode data of the compressed speech signal to generate a reconstructed speech signal ^̂^^^^^ (also referred to as a synthesized speech sample) for a current time instant n that approximates an original speech signal that was previously compressed by a voice encoder (not shown). The voice encoder can be similar to and can perform some or all of the functions of the voice encoder 252 described above with respect to FIG. 2B, or other type of voice encoder. For example, the voice encoder can include a short-term LP engine, an LTP engine, and an FCB. In another example, the voice encoder can include a magnitude spectrum generator (e.g., Mel-scale magnitude spectrum or full spectrum magnitude), a short-term linear prediction (LP) engine, and a pitch tracker that detects a fundamental pitch harmonic frequency of the speech and pitch correlation. [0068] The voice encoder can extract (and in some cases quantize) a set of features (referred to as a feature set) from the speech signal, and can send the extracted (and in some cases quantized) feature set to the voice decoding signal synthesis system 300. The features that are computed by the voice encoder can depend on a particular encoder implementation used. Various illustrative examples of feature sets are provided below according to different encoder implementations, which can be extracted by the voice encoder (and in some cases quantized), and sent to the voice decoding signal synthesis system 300. However, one of ordinary skill will appreciate that other feature sets can be extracted by the voice encoder. For example, the voice encoder can extract any set of features, can quantize that feature set, and can send the feature set to the voice decoding signal synthesis system 300. Qualcomm Docket No.2400178WO [0069] As noted above, various combinations of features can be extracted as a feature set by the voice encoder. For example, a feature set can include one or any combination of the following features: Linear Prediction (LP) coefficients; Line Spectral Pairs (LSPs); Line Spectral Frequencies (LSFs); pitch lag with integer or fractional accuracy; pitch gain; pitch correlation; Mel-scale frequency cepstral coefficients (also referred to as Mel cepstrum) of the speech signal; Bark-scale frequency cepstral coefficients (also referred to as bark cepstrum) of the speech signal; Mel-scale frequency cepstral coefficients of the LTP residual; Bark-scale frequency cepstral coefficients of the LTP residual; a spectrum (e.g., Discrete Fourier Transform (DFT) or other spectrum) of the speech signal; and/or a spectrum (e.g., DFT or other spectrum) of the LTP residual; voicing level of each frequency band of each speech frame; fundamental frequency of pitch harmonics; pitch correlation of each speech frame; time domain pitch lag of each speech frame. [0070] For any one or more of the other features listed above, the voice encoder can use any estimation and/or quantization method, such as an engine or algorithm from any suitable voice codec (e.g. EVS, AMR, or other voice codec) or a neural network-based estimation and/or quantization scheme (e.g., convolutional or fully-connected (dense) or recurrent Autoencoder, or other neural network-based estimation and/or quantization scheme). The voice encoder can also use any frame size, frame overlap, and/or update rate for each feature. The voice encoder can also include extra redundancies in the features to ensure robustness of operation against packet losses. Examples of estimation and quantization methods for each example feature are provided below for illustrative purposes, where other examples of estimation and quantization methods can be used by the voice encoder. [0071] As noted above, one example of features that can be extracted from a voice signal by the voice encoder includes linear prediction (LP) coefficients and/or line spectral frequencies (LSFs). Various estimation techniques can be used to compute the LP coefficients and/or LSFs. For example, the voice encoder can estimate LP coefficients (and/or LSFs) from a speech signal using the Levinson-Durbin algorithm. In some examples, the LP coefficients and/or LSFs can be estimated using an autocovariance method for LP estimation. In some cases, the LP coefficients can be determined, and an LP to LSF conversion algorithm can be performed to obtain the LSFs. Any other LP and/or LSF estimation engine or algorithm can be used, such as an LP and/or LSF estimation engine or algorithm from an existing codec (e.g., EVS, AMR, or other voice Qualcomm Docket No.2400178WO codec). [0072] Various quantization techniques can be used to quantize the LP coefficients and/or LSFs. For example, the voice encoder can use a single stage vector quantization (SSVQ) technique, a multi-stage vector quantization (MSVQ), or other vector quantization technique to quantize the LP coefficients and/or LSFs. In some cases, a predictive or adaptive SSVQ or MSVQ (or other vector quantization technique) can be used to quantize the LP coefficients and/or LSFs. In another example, an autoencoder or other neural network based technique can be used by the voice encoder to quantize the LP coefficients and/or LSFs. Any other LP and/or LSF quantization engine or algorithm can be used, such as an LP and/or LSF quantization engine or algorithm from an existing codec (e.g., EVS, AMR, or other voice codec). [0073] Autoencoders have gained popularity in recent years as they are able to learn efficient representations of input data without the need of labels (e.g., based on performing unsupervised learning). Various types of autoencoders exist and are well explained in “Autoencoder and its various variants” 2018 IEEE International Conference on Systems, Man, and Cybernetics , Zhang et. al. In speech coding, autoencoders conditioned on log Mel Cepstrum inputs and/or spectrogram inputs have been used for speech compression in "Enhancing into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders" arXiv:2102.06610v1 [eess.AS] 12 Feb 2021, Casebeer et. al. [0074] Speech coding is a lossy compression process. Autoencoders perform dimensionality reduction (e.g., an N-dimensional vector input vector is passed into the encoder of the autoencoder, and lossy compression is performed to represent the important aspects of the input vector in an M-dimensional vector, where M is smaller than N, e.g. by an order of magnitude). A drawback of using an autoencoder for speech coding is that it cannot necessarily exploit the temporal relationship between sets of input data. U.S. Patent No. 11,526,734 (“the ‘734 patent”, assigned to Qualcomm, Inc.), "Method and Apparatus for recurrent auto encoding", Yang et. al. improves upon an autoencoder to exploit temporal redundancies and correlations and describes a feedback recurrent autoencoder ("FRAE"). The FRAE in the '734 patent can be used for training and application of compression of sequential data with temporal correlation. The recurrent structure of the FRAE can be used to efficiently extract the redundancy embedded along Qualcomm Docket No.2400178WO the time-dimension of sequential data and enables compact discrete representation of the data at the bottleneck in a sequential fashion. For example, Table 1 of the '734 patent illustrates the MSE (Mean Squared Error) used as Mel-scale mean-square-error configured as a reconstruction loss for training of the FRAE, where the MSE of each frequency bin is scaled according to its weight at Mel-frequency both for latent feedback and output feedback. [0075] The FRAE described in the ‘734 patent has two advantageous features not present in autoencoders: (1) recurrent layers, e.g. LSTM or GRU layers, have memory of the past; and (2) feedback from the decoder of the autoencoder to the encoder of the autoencoder. The feedback connection 150 in Fig.2, Fig.3 and Fig. 7 of the ‘734 patent provides additional historical information from a state (ht) in the decoder to the encoder indicative of how reconstruction of a prior input has fared. This feedback loop is present during training and inference, which allows the encoder to be trained to respond to particular feedback conditions in a manner that improves reproduction of the input data by the decoder. Thus, the feedback loop is analogous to a mode switch input in that it is not encoded by the encoder, but it is used as an input that influences how the encoder operates on the next input vector. [0076] In addition, the ‘734 patent describes a second feedback connection (152) from the state (ht) of the decoder for a first iteration of series of inputs Xt, to the next iteration of series of inputs Xt+1. This second feedback connection (152) enables the decoder to learn from its previous reconstruction attempts, providing additional historical context about how reconstruction of a prior input has fared. This second feedback loop is also present during both training and inference, which allows the decoder to be trained to respond to particular feedback conditions in a manner that improves reproduction of the input data by the decoder. [0077] The ‘734 patent also describes other optional feedback connections. For example, an embedding vector (z) may be fed back via a third feedback connection (356 in Fig.3 of the ‘734 patent) to the encoder. As another optional example, the '734 patent describes that the FRAE can be a variational autoencoder. In this example, an output of the decoder (754 in Fig. 7 of the ‘734 patent) is sample and parameterized to a generate an autoregressive prior which can be used as a fourth feedback connection to condition a prior model for the next latent space embedded vector (zt+1). Similar to the previously Qualcomm Docket No.2400178WO described feedback connection, each of these optional connections is present both training and inference, and thus are trained as part of the training process thereby improving functionality of the FRAE. [0078] Another example of features that can be extracted from a voice signal by the voice encoder includes pitch lag (integer and/or fractional), pitch gain, and/or pitch correlation. Various estimation techniques can be used to compute the pitch lag, pitch gain, and/or pitch correlation. For example, the voice encoder can estimate the pitch lag, pitch gain, and/or pitch correlation (or any combination thereof) from a speech signal using any pitch lag, gain, correlation estimation engine or algorithm (e.g. autocorrelation- based pitch lag estimation). For example, the voice encoder can use a pitch lag, gain, and/or correlation estimation engine (or algorithm) from any suitable voice codec (e.g. EVS, AMR, or other voice codec). Various quantization techniques can be used to quantize the pitch lag, pitch gain, and/or pitch correlation. For example, the voice encoder can quantize the pitch lag, pitch gain, and/or pitch correlation (or any combination thereof) from a speech signal using any pitch lag, gain, correlation quantization engine or algorithm from any suitable voice codec (e.g. EVS, AMR, or other voice codec). In some cases, an autoencoder or other neural network based technique can be used by the voice encoder to quantize the pitch lag, pitch gain, and/or pitch correlation features. [0079] Another example of features that can be extracted from a voice signal by the voice encoder includes the Mel cepstrum coefficients and/or Bark cepstrum coefficients of the speech signal, and/or the Mel cepstrum coefficients and/or Bark cepstrum coefficients of the LTP residual. Various estimation techniques can be used to compute the Mel cepstrum coefficients and/or Bark cepstrum coefficients. For example, the voice encoder can use a Mel or Bark frequency cepstrum technique that includes Mel or Bark frequency filter banks computation, filter bank energy computation, logarithm application, and discrete cosine transform (DCT) or truncation of the DCT. Various quantization techniques can be used to quantize the Mel cepstrum coefficients and/or Bark cepstrum coefficients. For example, vector quantization (single stage or multistage) or predictive/adaptive vector quantization can be used. In some cases, an autoencoder or other neural network based technique can be used by the voice encoder to quantize the Mel cepstrum coefficients and/or Bark cepstrum coefficients. Any other suitable cepstrum quantization methods can be used. Qualcomm Docket No.2400178WO [0080] Another example of features that can be extracted from a voice signal by the voice encoder includes the spectrum of the speech signal and/or the spectrum of the LTP residual. Various estimation techniques can be used to compute the spectrum of the speech signal and/or the LTP residual. For example, a Discrete Fourier transform (DFT), a Fast Fourier Transform (FFT), or other transform of the speech signal can be determined. Quantization techniques that can be used to quantize the spectrum of the voice signal can include vector quantization (single stage or multistage) or predictive/adaptive vector quantization. In some cases, an autoencoder or other neural network based technique can be used by the voice encoder to quantize the spectrum. Any other suitable spectrum quantization methods can be used. [0081] As noted above, any one of the above-described features or any combination of the above-described features can be estimated, quantized, and sent by the voice encoder to the voice decoding signal synthesis system 300 depending on the particular encoder implementation that is used. In one illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, and pitch correlation. In another illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the Bark cepstrum of the speech signal. In another illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the spectrum (e.g., DFT, FFT, or other spectrum) of the speech signal. In another illustrative example, the voice encoder can estimate, quantize, and send pitch lag with fractional accuracy, pitch gain, pitch correlation, and the Bark cepstrum of the speech signal. In another illustrative example, the voice encoder can estimate, quantize, and send pitch lag with fractional accuracy, pitch gain, pitch correlation, and the spectrum (e.g., DFT, FFT, or other spectrum) of the speech signal. In another illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the Bark cepstrum of the LTP residual. In another illustrative example, the voice encoder can estimate, quantize, and send LP coefficients, pitch lag with fractional accuracy, pitch gain, pitch correlation, and the spectrum (e.g., DFT, FFT, or other spectrum) of the LTP residual. [0082] The voice decoding signal synthesis system 300 includes a neural network filter estimator 302, a linear time-varying filter 304 generated by the neural network, and a linear predictive coding (LPC) filter 306. The LPC filter 306 can include a time-varying Qualcomm Docket No.2400178WO LPC filter. The neural network filter estimator 302 is trained to generate filter coefficients for the linear time-varying filter 304. The neural network model of the neural network filter estimator 302 can include any neural network architecture that can be trained to model the filter coefficients for the linear time-varying filter 304. Examples of neural network architectures that can be included in the neural network filter estimator 302 include a generative neural network (e.g., a generative-adversarial network (GAN)), convolutional neural networks (CNN), an autoencoder, and/or other type(s) of neural network architectures or models. [0083] The voice decoding signal synthesis system 300 (e.g., the neural network model of the neural network filter estimator 302) can be trained using any suitable neural network training technique. In some examples, the neural network model of the neural network filter estimator 302 can be trained using supervised learning techniques based on backpropagation. For instance, corresponding input and target output pairs can be provided to the neural network filter estimator 302 for training. In one example, for each time instant n, the input to the neural network filter estimator 302 can include log-Mel- frequency spectrum features or coefficients (e.g., 80 log-Mel features ^^^^^, ^^^ 301 shown in FIG.3). In some examples, the target output (or label or ground truth) for training the neural network filter estimator 302 can include the target speech sample ^^^^^^ 305 for the current time instant n, as shown in FIG.3. In such examples, a loss 307 will be computed based on the reconstructed sample ^̂^ ^ ^^ ^ and the target output speech sample ^^ ^ ^^ ^ 305 for time instant n. In some examples, the target output can include a speech signal ^^^^^^ that is generated after passing the target speech through an LPC analysis filter (inverse of LPC filter 306). In this case, the loss will be computed based on the output ^̂^^^^^ generated by the linear time-varying filter generated by the neural network 304 and the target output ^^^^^^ (e.g., where both are in speech residual domain). [0084] Backpropagation can be performed to train the neural network filter estimator 302 using the inputs and the target output. Backpropagation can include a forward pass, a loss function, a backward pass, and a parameter update to update one or more parameters (e.g., weight, bias, or other parameter). The forward pass, loss function, backward pass, and parameter update are performed for one training iteration. The process can be repeated for a certain number of iterations for each set of inputs until the neural network filter estimator 302 is trained well enough so that the weights (and/or other parameters) of the various layers are accurately tuned. Qualcomm Docket No.2400178WO [0085] In some aspects, training of the neural network filter estimator 302 and/or training of one or more of the machine learning systems or neural networks described herein (e.g., such as the decoder model 510-1 with first complexity, decoder model 510- 2 with second complexity, and/or decoder model 510-N with Nth complexity, etc., of FIG. 5; any of the decoder model instances 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, and/or 610-8 of FIGS. 6A-6B; the subset 1 of decoder neural network layers 715-1, the subset 2 of decoder neural network layers 715-2, and/or the subset N of decoder neural network layers 715-N of FIG. 7; etc.) can be performed using online training (e.g., in some case on-device training), offline training, and/or various combinations of online and offline training. [0086] In some cases, online may refer to time periods during which the input data (e.g., such as the bitstream 502 of FIG.5, the audio signal of FIG.2A, the speech signal 251 of FIG. 2B, the excitation signal 303 and/or features 301 of FIG. 3, the features S[m,c] of FIG. 3, and/or the dequantized features output by the feature dequantization engine 642 of FIG. 6B, etc.) is processed, for instance for performance of the neural network-based temporal feature upsampling processing implemented by the systems and techniques described herein. In some examples, offline may refer to idle time periods or time periods during which input data is not being processed. Additionally, offline may be based on one or more time conditions (e.g., after a particular amount of time has expired, such as a day, a week, a month, etc.) and/or may be based on various other conditions such as network and/or server availability, etc., among various others. In some aspects, offline training of a machine learning model (e.g., a neural network model) can be performed by a first device (e.g., a server device) to generate a pre-trained model, and a second device can receive the trained model from the second device. In some cases, the second device (e.g., a mobile device, an XR device, a vehicle or system/component of the vehicle, or other device) can perform online (or on-device) training of the pre-trained model to further adapt or tune the parameters of the model. [0087] In some cases, the forward pass can include passing the input data (e.g., the log- Mel-frequency spectrum features or coefficients, such as the 80 log-Mel features ^^^^^, ^^^ 301 shown in FIG.3) through the neural network filter estimator 302. The weights of the neural network model are initially randomized before the neural network filter estimator 302 is trained. For a first training iteration for the neural network filter estimator 302, the output will likely include values that do not give preference to any particular output due Qualcomm Docket No.2400178WO to the weights being randomly selected at initialization. With the initial weights, the neural network filter estimator 302 is unable to determine low level features and thus cannot make an accurate estimation of the filter coefficients for the linear time-varying filter 304. A loss function can be used to analyze the loss 307 (or error) in the reconstructed or synthesized sample output ^̂^^^^^. Any suitable loss function definition can be used. One example of a loss function includes a mean squared error (MSE). The MSE is defined as ^^௧^௧^^ ൌ ∑ ^ ଶ ^^^^^^^^^^^^^ െ ^^^^^^^^^^^^^ଶ, which calculates the sum of one-half times the actual answer minus the predicted (output) answer squared. The loss can be set to be equal to the value of ^^௧^௧^^. Other loss functions may include a difference of magnitude spectrums between the target and output signals, where the difference may be computed as absolute difference, squared difference, or logarithmic difference between the magnitude spectrum of each speech frame, and then aggregated over all speech frames. [0088] The loss (or error) will be high for the first training iterations since the actual values will be much different than the predicted output. The goal of training is to minimize the amount of loss so that the predicted output is the same as the training label. The neural network filter estimator 302 can perform a backward pass by determining which inputs (weights) most contributed to the loss of the network, and can adjust the weights so that the loss decreases and is eventually minimized. A derivative of the loss with respect to the weights (denoted as dL/dW, where W represents the weights at a particular layer) can be computed to determine the weights that contributed most to the loss of the network. After the derivative is computed, a weight update can be performed by updating all the weights of the filters. For example, the weights can be updated so that they change in the opposite direction of the gradient. The weight update can be denoted as ^^ ൌ ^^ ௗ^ ^ െ ^^ ௗ^, where w denotes a weight, wi denotes the initial weight, and η denotes a learning rate can be set to any suitable value, with a high learning rate including larger weight updates and a lower value indicating smaller weight updates. In some examples, to train the neural network filter estimator 302, a multi-resolution STFT loss ^^ and adversarial losses ^^ and ^^^ can be computed from ^^^^^^ and ^^^^^^. Because linear time- varying filters are fully differentiable, gradients can propagate back to the neural network filter estimator 302. [0089] Using the filter coefficients generated by the neural network filter estimator 302, the linear time-varying filter 304 can process an excitation signal 303 to generate another Qualcomm Docket No.2400178WO signal ^̂^^^^^. The signal ^̂^^^^^ can be used as an excitation signal to excite the LPC filter 306. The linear time-varying filter 304 is a linear filter, which preserves the linearity property between inputs and outputs. For instance, a linear filter is associated with a mapping ℒ:ℝℤ → ℝℤ, ^^^^^^ → ^^^^^^ ൌ ℒ^^^^^^^^ that has the following property: for any ^^^^^^^, ^^ଶ^^^^ ∈ ℝℤ and any ^^, ^^ ∈ ℝ, ℒ^^^ ^^^^^^^ ^ ^^ ^^ଶ^^^^^ ൌ ^^ ℒ^^^^^^^^^ ^ ^^ ℒ^^^ଶ^^^^^. In one illustrative example, if input1 produces output1 and input2 produces output2, then a combined input of varying nature of the linear time-varying filter 304 indicates that the filter response depends on the time of excitation of the linear time-varying filter 304 (e.g., a new set of coefficients used to filter each frame (block of time) of input at the time of excitation). In some cases, time-varying linear filters can be characterized by the set of impulse responses at each time lag ℎ^^^^^ ൌ ℒ^^^^^^ െ ^^^^ for each ^^ ∈ ℤ. In some examples, the output of a time varying linear filter is ℒ^^^^^^^^ ൌ ∑^ ^^^^^^ ⋆ ℎ^^^^^ (where ⋆ is a convolutional operator) or some heuristic combination of filter input and impulse responses, e.g. overlap-add on windowed and filtered signal segments, etc. [0090] The LPC filter 306 can use the signal ^̂^^^^^ as input to generate the reconstructed or synthesized speech sample ^̂^^^^^ for the current time instant n. The LPC filter 306 is a linear filter and in some cases is time varying, as defined above with respect to the linear time-varying filter 304. The LPC filter 306 can be a form of time-varying filter used for processing of speech. The LPC filter 306 includes filter coefficients for each speech frame that can be computed using the autocorrelation of a speech or audio signal. [0091] In some examples, the LPC filter 306 can be used to model the spectral shape (or phenome or envelope) of the speech signal. For example, at the voice encoder, a signal ^^^^^^ can be filtered by an autoregressive (AR) process synthesizer to obtain an AR signal. As described above, a linear predictor can be used to predict the AR signal (which can be denoted as prediction ^̂^^^^^) as a linear combination of the previous m samples as follows: ^ ^^^^^^^ [0092] where the ^^^ ^ … of the AR parameters (also referred to as LP coefficients). A residual signal can be the difference between the original AR signal and the predicted AR signal represented as prediction ^̂^^^^^ (e.g., the difference Qualcomm Docket No.2400178WO between the actual sample and the predicted sample). At the voice encoder, the linear prediction coding can be used to find the best linear prediction coefficients for minimizing a quadratic error function, and thus the error. The linear prediction process removes the short-term correlation from the speech signal. The linear prediction coefficients are an efficient way to represent the short-term spectrum of the speech signal. At the voice decoding signal synthesis system 300, the LPC filter 306 determines the prediction ^̂^^^^^ for the current sample n using computed or received coefficients and A transfer function ^ ^ ^^^^^. For instance, in some examples, the LPC filter coefficients are received from the encoder. In other examples, the voice decoding signal synthesis system 300 can derive the LPC filter coefficients, such as using other features (e.g., Mel spectrum features) sent by an encoder to the voice decoding signal synthesis system 300. For instance, the voice decoding signal synthesis system 300 can use Mel spectrum features 301 to derive the LPC filter coefficients for the LPC filter 306. The LPC filter 306 can determine the final reconstructed (or predicted) sample ^̂^^^^^ using the output ^̂^^^^^ from the linear time- varying filter 304 (for the current sample n). [0093] In some examples, further components can be used along with the neural network filter estimator 302 and the linear time-varying filter 304, such as an impulse train generator 314 and a random noise generator 316. In some examples, the linear time- varying filter 304 can include a harmonic linear time-varying filter 318 and a noise linear time-varying filter 320. A voice encoder can include a pitch tracker 310 and a feature extraction engine 312. In some examples, the feature extraction engine 312 can be the same as or similar to the feature generator 200 of FIG. 2A. In some cases, the original speech signal ^^ and reconstructed signal ^^ are divided into non-overlapping frames with frame length L. The term ^^ can be defined as a frame index, the term ^^ can be defined as a discrete time index, and the term ^^ can be defined as a feature index. The total number of frames ^^ and total number of sampling points ^^ may follow ^^ ൌ ^^ ൈ ^^. In ^^^, ^^, ℎ^, ℎ^, 0 ^ ^^ െ 1. The terms ^^, ^^, ^^, ^^, ^^^, ^^^ are finite duration signals, in which 0 ^ ^^ ^ ^^ െ 1. Impulse responses ℎ^, and ℎ^ may be infinitely long, in which ^^ ∈ ℤ. Impulse response h may be causal, in which ^^ ∈ ℤ E Z and ^^ ^ 0. [0094] To perform the speech synthesis process, the impulse train generator 314 can generate an impulse train ^^^^^^ from a frame-wise fundamental frequency ^^^^^^^ output by the pitch tracker 310. In one illustrative example, the impulse train generator 314 can Qualcomm Docket No.2400178WO generate alias-free discrete time impulse trains using additive synthesis. For instance, as illustrated in equation (1) below, the impulse train generator 314 can use a low-passed sum of sinusoids to generate an impulse train: ^2^^^^^^^^^ ^ ^^^ ^^^^^^ ^^௧ ^2^^^^ ^^^^^^^^^^^^ , ^^^^^^ ^ ^^ ൌ 1 ^^^^ ^ [0095] where hold or linear interpolation, ^^^^^^ ൌ ^^^^^/^^^^, and ^^^ is the sampling rate. In some cases, the computationally complexity of additive synthesis can be reduced with approximations. For example, the impulse train generator 314 or other component (e.g., a processor) of the voice decoding signal synthesis system 300 can round the fundamental periods to the nearest multiples of the sampling period. In such an example, the discrete impulse train is sparse. The impulse train generator 314 can then generate the impulse train sequentially (e.g., one pitch mark at a time). [0096] The pitch tracker 310 can process the input ^^^^^^ for the time instant n to generate the frame-wise fundamental frequency ^^^ ^^^^ output, which is provided to and processed by the impulse train generator 314 of the voice decoding signal synthesis system 300. The random noise generator 316 of the voice decoding signal synthesis system 3400 can sample a noise signal ^^^^^^ from a Gaussian distribution. [0097] The neural network filter estimator 302 can estimate impulse responses ℎ^ ^^^,^^^ and ℎ^ ^^^,^^^ for each frame, given the log-Mel spectrogram ^^^^^, ^^^ extracted from the input ^^^^^^ by the feature extraction engine 312 of the encoder. In some aspects, complex cepstrums (ℎ ^ ^ and ℎ ^ ^) can be used as the internal description of impulse responses (ℎ^ and ℎ^) for the neural network filter estimator 302. Complex cepstrums describe the magnitude response and the group delay of filters simultaneously. The group delay of filters affects the timbre of speech. In some cases, instead of using linear-phase or minimum-phase filters, the neural network filter estimator 302 can use mixed-phase filters, with phase characteristics learned from the dataset. [0098] In some examples, the length of a complex cepstrum can be restricted, essentially restricting the levels of detail in the magnitude and phase response. Restricting the length of a complex cepstrum can be used to control the complexity of the filters. In Qualcomm Docket No.2400178WO some cases, the neural network filter estimator 302 can predicts low-frequency coefficients, in which the high-frequency cepstrum coefficients can be set to zero. In one illustrative example, two 10 millisecond (ms) long complex cepstrums are predicted in each frame. In some cases, the neural network filter estimator 302 can use a discrete Fourier transform (DFT) and an inverse-DFT (IDFT) to generate the impulse responses ℎ^ and ℎ^. In some cases, the neural network filter estimator 302 can approximate an infinite impulse response (IIR) (ℎ^^^^,^^^ and ℎ^^^^,^^^) using finite impulse responses (FIRs). The DFT size can be set to at least a threshold size (e.g., N=1024) to avoid aliasing. [0099] Using the impulse response ℎ^ ^^^,^^^, the harmonic LTV filter 318 can filter the impulse train ^^^^^^ from the impulse train generator 314 to generate a harmonic component ^^^^^^^. Using the impulse response ℎ^ ^^^,^^^, the noise LTV filter 320 can filter the noise signal ^^^^^^ to generate a noise component ^^^^^^^. The voice decoding signal synthesis system 300 can combine (e.g., by summing/adding or otherwise combining) the output of the harmonic LTV filter 318 (the harmonic component ^^^^^^^) and the output of the noise LTV filter 320 (the noise component ^^^^^^^) can be combined (e.g., summed or otherwise combined) to obtain the excitation signal ^^^^^^. [0100] FIG. 4 illustrates an example of a computing system 470 of a wireless device 407. The wireless device 407 may include a client device such as a user equipment (UE) UE or other type of device that may be used by an end-user. For example, the wireless device 407 may include a mobile phone, router, tablet computer, laptop computer, tracking device, wearable device (e.g., a smart watch, glasses, an extended reality (XR) device such as a virtual reality (VR), augmented reality (AR), or mixed reality (MR) device, etc.), Internet of Things (IoT) device, a vehicle, an aircraft, and/or another device that is configured to communicate over a wireless communications network. The wireless device 407 of FIG. 4 may include and/or implement the SoC 100 of FIG. 1. The computing system 470 includes software and hardware components that may be electrically or communicatively coupled via a bus 489 (e.g., or may otherwise be in communication, as appropriate). For example, the computing system 470 includes one or more processors 484. The one or more processors 484 may include one or more CPUs, ASICs, FPGAs, NPUs, DSPs, GPUs, VPUs, NSPs, microcontrollers, hardware accelerators, dedicated hardware, any combination thereof, and/or other processing Qualcomm Docket No.2400178WO device or system. The bus 489 may be used by the one or more processors 484 to communicate between cores and/or with the one or more memory devices 486. [0101] The computing system 470 may also include one or more memory devices 486, one or more digital signal processors (DSPs) 482, one or more SIMs 474, one or more modems 476, one or more wireless transceivers 478, an antenna 487, one or more input devices 472 (e.g., a camera, a mouse, a keyboard, a touch sensitive screen, a touch pad, a keypad, a microphone, and/or the like), and one or more output devices 480 (e.g., a display, a speaker, a printer, and/or the like). [0102] In some aspects, computing system 470 may include one or more radio frequency (RF) interfaces configured to transmit and/or receive RF signals. In some examples, an RF interface may include components such as modem(s) 476, wireless transceiver(s) 478, and/or antennas 487. The one or more wireless transceivers 478 may transmit and receive wireless signals (e.g., signal 488) via antenna 487 from one or more other devices, such as other wireless devices, network devices (e.g., base stations such as eNBs and/or gNBs, Wi-Fi access points (APs) such as routers, range extenders or the like, etc.), cloud networks, and/or the like. In some examples, the computing system 470 may include multiple antennas or an antenna array that may facilitate simultaneous transmit and receive functionality. Antenna 487 may be an omnidirectional antenna such that radio frequency (RF) signals may be received from and transmitted in all directions. The wireless signal 488 may be transmitted via a wireless network. The wireless network may be any wireless network, such as a cellular or telecommunications network (e.g., 3G, 4G, 5G, etc.), wireless local area network (e.g., a Wi-Fi network), a BluetoothTM network, and/or other network. [0103] In some examples, the wireless signal 488 may be transmitted directly to other wireless devices using sidelink communications (e.g., using a PC5 interface, using a DSRC interface, etc.). Wireless transceivers 478 may be configured to transmit RF signals for performing sidelink communications via antenna 487 in accordance with one or more transmit power parameters that may be associated with one or more regulation modes. Wireless transceivers 478 may also be configured to receive sidelink communication signals having different signal parameters from other wireless devices. [0104] In some examples, the one or more wireless transceivers 478 may include an RF front end including one or more components, such as an amplifier, a mixer (e.g., also Qualcomm Docket No.2400178WO referred to as a signal multiplier) for signal down conversion, a frequency synthesizer (e.g., also referred to as an oscillator) that provides signals to the mixer, a baseband filter, an analog-to-digital converter (ADC), one or more power amplifiers, among other components. The RF front-end may generally handle selection and conversion of the wireless signals 488 into a baseband or intermediate frequency and may convert the RF signals to the digital domain. [0105] In some cases, the computing system 470 may include a coding-decoding device (or CODEC) configured to encode and/or decode data transmitted and/or received using the one or more wireless transceivers 478. In some cases, the computing system 470 may include an encryption-decryption device or component configured to encrypt and/or decrypt data (e.g., according to the AES and/or DES standard) transmitted and/or received by the one or more wireless transceivers 478. [0106] The one or more SIMs 474 may each securely store an international mobile subscriber identity (IMSI) number and related key assigned to the user of the wireless device 407. The IMSI and key may be used to identify and authenticate the subscriber when accessing a network provided by a network service provider or operator associated with the one or more SIMs 474. The one or more modems 476 may modulate one or more signals to encode information for transmission using the one or more wireless transceivers 478. The one or more modems 476 may also demodulate signals received by the one or more wireless transceivers 478 in order to decode the transmitted information. In some examples, the one or more modems 476 may include a Wi-Fi modem, a 4G (or LTE) modem, a 5G (or NR) modem, and/or other types of modems. The one or more modems 476 and the one or more wireless transceivers 478 may be used for communicating data for the one or more SIMs 474. [0107] The computing system 470 may also include (and/or be in communication with) one or more non-transitory machine-readable storage media or storage devices (e.g., one or more memory devices 486), which may include, without limitation, local and/or network accessible storage, a disk drive, a drive array, an optical storage device, a solid- state storage device such as a RAM and/or a ROM, which may be programmable, flash- updateable, and/or the like. Such storage devices may be configured to implement any appropriate data storage, including without limitation, various file systems, database structures, and/or the like. Qualcomm Docket No.2400178WO [0108] In various aspects, functions may be stored as one or more computer-program products (e.g., instructions or code) in memory device(s) 486 and executed by the one or more processor(s) 484 and/or the one or more DSPs 482. The computing system 470 may also include software elements (e.g., located within the one or more memory devices 486), including, for example, an operating system, device drivers, executable libraries, and/or other code, such as one or more application programs, which may comprise computer programs implementing the functions provided by various aspects, and/or may be designed to implement methods and/or configure systems, as described herein. [0109] As noted previously, systems and techniques are provided that can be used to provide dynamic complexity scaling for a neural network-based decoder and/or signal synthesis engine configured to process one or more frames of data (e.g., audio data, voice data, speech data, video data, etc.). In some aspects, the frames of data processed by the decoder can be associated with a real-time processing requirement (e.g., a real-time processing time limit) for the decoder to finish decoding and/or processing of a respective frame. [0110] For example, the one or more frames of data can be speech frames, where the real-time processing requirement is equal to the length of each speech frame (e.g., 20ms). In some cases, the systems and techniques can be used to provide dynamic complexity scaling for a neural network-based decoder, vocoder, and/or signal synthesizer (e.g., speech synthesizer) associated with processing one or more frames of audio data (e.g., speech frames, etc.). As used herein, a “neural decoder” and/or “neural network-based decoder” may include one or more of a neural network-based audio decoder, a neural network-based voice decoder, a neural network-based speech decoder, a neural network- based vocoder, a neural network-based signal synthesizer, a neural network-based audio synthesizer, a neural network-based voice or speech synthesizer, etc. [0111] In some examples, the neural decoder can be included in a signal synthesis system, such as the signal synthesis system 120 of FIG. 1. In some aspects, the neural decoder can be included in a neural vocoder system and/or a neural homomorphic vocoder (NHV) system. A vocoder can refer to a voice coder that may be configured to perform voice encoding, voice decoding, or both. Vocoders can be used for various voice and speech processing tasks, based on using the vocoder to analyze and/or synthesize a human voice signal. For example, vocoders may be used in the context of telecommunications to Qualcomm Docket No.2400178WO encode speech for transmission over bandwidth-limited and/or bitrate limited channels. The use of vocoders can extend beyond telecommunications and may include various other speech synthesis, voice transformation, and audio processing tasks, etc., among various others. A neural vocoder system can implement a voice encoder (e.g., vocoder) using one or more neural network models. Some neural vocoders may be designed to include speech synthesis models to reduce computational complexity and further improve synthesis quality. Many neural vocoder techniques may be based on using a source-filter model to improve source signal modeling. For example, source-filter model neural vocoders use neural networks to generate source signals (e.g., linear prediction residual signals), and offload spectral shaping processing tasks to time-varying filters. [0112] In another technique, neural source-filter (NSF)-based approaches replace linear filters in the classical vocoder model with one or more convolutional neural network- based filters. NSF-based neural vocoders can be used to synthesize waveforms based on filtering a sine-based excitation signal and/or performing neural audio synthesis with one or more sinusoidal models. A neural homomorphic vocoder (NHV) is a type of neural vocoder that can synthesize speech with source-filter models controlled by one or more neural networks. For example, an NHV-based neural vocoder may include one or more neural networks in a source-filter model that can synthesize speech based on filtering impulse trains and noise with linear time-varying (LTV) filters, with the one or more neural networks used to control the LTV filters by estimating complex cepstrums of time- varying impulse responses given acoustic features. [0113] Traditional or non-neural vocoders may operate based on decomposing speech into various parameters such as pitch, timbre, rhythm, etc., which can subsequently be manipulated and resynthesized to generate a desired output audio or voice signal. Neural vocoders (e.g., such as NHVs, etc.) can apply transformations and manipulations to speech signals directly within a learned feature space of the neural network. The learned feature space used by neural vocoders may capture more complex relationships and characteristics of speech than non-neural network-based signal processing and/or vocoder techniques. For example, neural vocoders can be trained to learn a mapping between raw speech waveforms and the spectral or cepstral representations of the speech. An NHV system can apply one or more homomorphic processing techniques within the learned space corresponding to the mapping. As noted above, neural vocoders (e.g., NHVs, etc.) and/or other signal synthesis systems may operate based on a stream of input features that Qualcomm Docket No.2400178WO are generated by an encoder and processed by a decoder. The stream of input features may be transmitted from an encoder of a first system to a decoder of a second system (e.g., the encoder and decoder may be separate). For example, a decoder may receive a stream of input features that are transmitted over a wired or wireless network, over a channel, etc. [0114] FIG. 5 is a block diagram illustrating an example of a dynamic decoder complexity switching system 500 that may be used with a decoder of a signal synthesis system, in accordance with some examples. For example, the dynamic decoder complexity switching system 500 can be implemented by and/or included within a neural decoder of a signal synthesis system, NHV system, etc. In some aspects, the dynamic decoder complexity switching system 500 can be implemented by and/or included within a UE or other computing device that implements the neural decoder. [0115] In one illustrative example, the dynamic decoder complexity switching system can be used to provide dynamic complexity switching between a plurality of different decoder model instances or configurations that correspond to different levels of complexity (e.g., processing complexity, synthesis complexity, etc.). For example, a decoder complexity switching engine 550 can be used to switch between respective decoder models of a plurality of decoder models 510-1, 510-2, …, 510-N. In some aspects, a switch 545 can be provided as a hardware switch, a logical switch, a virtual switch, a software switch, etc., that is configurable to switch an input bitstream 502 to a particular (e.g., selected or configured) decoder model of the plurality of decoder models 510-1, 510-2, …, 510-N. [0116] For example, the switch 545 can be configured to switch the input bitstream 502 to couple to the respective input of one of the plurality of decoder models 510-1, 510-2, …, 510-N having a complexity level corresponding to a dynamic complexity determination generated by the decoder complexity switching engine 550. [0117] In some cases, the bitstream 502 can be received from an encoder associated with the neural decoder and/or associated with the plurality of decoder models 510-1, 510-2, …, 510-N. For example, the bitstream 502 can include and/or be indicative of a plurality of features or an encoded representation of audio samples of one or more frames of audio data. For example, the bitstream 502 can include features or an encoded representation for one or more 20ms speech frames, where the features or other encoded Qualcomm Docket No.2400178WO representation are generated by an encoder and transmitted (e.g., over a wired or wireless channel) to the neural decoder (e.g., to the dynamic decoder complexity switching system 500). [0118] In some cases, the bitstream 502 can be the same as or similar to the audio feature vector 208 generated by the feature generator 200 based on the audio signal of FIG.2A. In some examples, the bitstream 502 can be the same as or similar to the output of the voice encoder 252 of FIG.2B generated based on encoding the speech signal 251. In some aspects, the bitstream 502 can be the same as or similar to the input features 301 provided to the NN filter estimator 302 of the NHV system 300 of FIG. 3 (e.g., the bitstream 502 can be the same as or similar to a bitstream of features generated by the feature extraction engine 312 included in the encoder of FIG.3, etc.). [0119] In one illustrative example, the systems and techniques can be used to provide dynamic decoder complexity scaling to finish processing of a respective frame of audio data within a configured processing time limit (e.g., the real-time processing requirement for the frame of audio data, the duration of each respective frame of audio data, etc.). For example, the decoder complexity switching engine 550 can be configured to analyze device load information (e.g., obtained from a device resource monitor 520) and/or decoder processing time measurements or profile information (e.g., obtained from a decoder profiling engine 530). [0120] Based on the device load information and/or the decoder processing time measurements provided as input to the decoder complexity switching engine 550, the decoder complexity switching engine 550 can determine a decoder complexity configuration for processing one or more frames (e.g., audio frames) of the input bitstream 502. In one illustrative example, the decoder complexity configuration determined by the decoder complexity switching engine 550 can be used to control the switch 545 to couple the input bitstream 502 to a particular decoder model of the plurality of decoder models 510-1, 510-2, …, 510-N. The particular decoder model activated by the switch 545 to receive the input bitstream 502 can be the particular decoder model that is associated with a respective complexity the same as (or most similar to) the decoder complexity configuration determined by the decoder complexity switching engine 550. [0121] In some aspects, the decoder complexity switching engine 550 can be used to determine a decoder complexity configuration that corresponds to real-time completion Qualcomm Docket No.2400178WO of the current frame decoding (e.g., completion within the real-time processing requirement or limit for the decoder-side processing of the current frame audio data within input bitstream 502). [0122] In some cases, the decoder complexity switching engine 550 determines a decoder complexity configuration and/or selects a particular decoder model instance of the plurality of decoder model instances 510-1, 510-2, …, 510-N that guarantees real- time completion of current frame decoding for the input bitstream 502. [0123] In some examples, the decoder complexity switching engine 550 determines a decoder complexity configuration and/or selects a particular decoder model instance of the plurality of decoder model instances 510-1, 510-2, …, 510-N that has a maximum probability or likelihood of real-time completion of current frame decoding for the input bitstream 502 within the real-time processing requirement (e.g., within the 20ms processing time requirement for 20ms frames of audio data represented within the input bitstream 502). [0124] In some cases, the decoder complexity switching engine 550 can determine a decoder complexity configuration and/or can select a particular decoder model instance of the plurality of decoder model instances 510-1, 510-2, …, 510-N that increases the probability or likelihood of real-time completion of current frame decoding for the input bitstream 502 within the real-time processing requirement (e.g., within the 20ms processing time requirement for 20ms frames of audio data represented within the input bitstream 502). For example, the decoder complexity switching engine 550 can determine a particular decoder model instance and corresponding decoder complexity level (e.g., decoder complexity configuration) that increases the probability of real-time completion of the current frame decoding over that of a currently or previously configured decoder model instance 510-1, 510-2, …, 510-N. In some cases, the increased probability of real- time completion of current frame decoding may be relative to the probability of real-time completion associated with using a default decoder model instance included in the plurality of decoder model instances 510-1, 510-2, …, 510-N. [0125] In one illustrative example, the decoder complexity switching engine 550 can determine the decoder complexity configuration and/or can select from among the plurality of decoder model instances 510-1, 510-2, …, 510-N based on analyzing one or more (or both) of device load information obtained from the device resource monitor 520 Qualcomm Docket No.2400178WO and/or decoder processing time measurements obtained from the decoder profiling engine 530. [0126] For example, the device resource monitor 520 can determine (e.g., monitor and output to the decoder complexity switching engine 550) device load information corresponding to a UE or other computing device used to implement the neural decoder (e.g., the selected one of the neural decoder instances 510-1, 510-2, …, 510-N) and the dynamic decoder complexity switching system 500 of FIG.5. [0127] In one illustrative example, the device load information can be obtained and/or determined by the device resource monitor 520, which may be configured to monitor the hardware resource utilization of a UE or other computing device used to implement the neural network-based decoder. For example, the device load information can include a current central processing unit (CPU) usage or utilization percentage and/or can include a current usage or utilization percentage for various other processors included in the device. The processor utilization information may be instantaneous utilization information corresponding to the current state of the CPU or other processor. In some cases, the processor utilization information can be an average utilization percentage over a configured time window. For example, the device load information can include or otherwise be indicative of an average CPU usage over the previous 5ms, 10ms, 15ms, etc. [0128] In some examples, the device load information determined, monitored, and/or obtained by the device resource monitor 520 can include a memory (e.g., random-access memory (RAM), processor cache, video RAM (VRAM), etc.) usage of the UE or other computing device used to implement the neural network-based decoder. The memory usage information can be indicative of an instantaneous memory percentage in use, an average memory percentage used over a configured time window (e.g., average memory usage over the previous 5ms, 10ms, 15ms, etc.), and/or a combination thereof. [0129] In some cases, the device resource monitor can obtain CPU, processor, and/or memory usage information from a resource statistics engine associated with the UE or other computing device and/or an operating system or firmware thereof. In some aspects, the CPU, processor, and/or memory usage information can include or comprise CPU, processor, and/or memory (respectively) resource statistics information representing an instantaneous utilization, an aggregate utilization, an average utilization over one or more configured time windows, etc. Qualcomm Docket No.2400178WO [0130] In some cases, the device load information obtained by the device resource monitor 520 can be monitoring information corresponding to one or more hardware components of a UE or other computing device used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG. 5. In some aspects, the device resource monitor 520 can be configured to obtain CPU resource statistics, metrics, usage information, history information, performance information, etc., based on transmitting one or more respective queries to corresponding CPU resources and/or corresponding communication interfaces thereof (e.g., included in and/or implemented by the UE or other computing device associated with the neural decoder and the dynamic decoder complexity switching system 500). [0131] In some aspects, the device resource monitor 520 can determine resource load information corresponding to one or more DSP engines or DSP cores of a UE or other computing device used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG. 5. In some examples, the device load information can include the DSP load and/or DSP resource monitoring information obtained responsive to the one or more DSP resource queries. For example, the device resource monitor 520 can transmit one or more DSP core queries to a DSP core and/or a DSP core communication interface. In some aspects, the DSP resource queries can be transmitted from the device resource monitor 520 to a DSP OS or other DSP client associated with the DSP or DSP core(s). In some examples, the DSP core queries can correspond to DSP core load information. For instance, the device resource monitor 520 can query a DSP core frequency and/or a DSP load from the DSP OS. In some aspects, the DSP load information may correspond to MCPS (millions of clocks per second) information, MPPS (millions of packets per second) information, and/or CPP information obtained from the DSP OS and/or client. [0132] In another example, the device resource monitor 520 can transmit one or more queries of DDR memory bandwidth and/or real-time DDR memory clock information from a memory subsystem of the UE or other computing device used to implement the neural decoder and/or dynamic decoder complexity switching system 500. The memory subsystem may be associated with and/or accessed through a corresponding client and/or through a dedicated communication interface or bus corresponding to the memory subsystem. In some cases, the memory subsystem can be associated with and can provide Qualcomm Docket No.2400178WO responsive to the memory load query information indicative of DDR memory utilization, DDR memory bandwidth, etc. [0133] In another example, the device resource monitor 520 may determine the device load information based on querying bus active clock frequency information from a DSP OS (e.g., a client interface to the DSP OS, etc.). The bus active clock frequency information can be associated with a bus clock of the UE or other computing device. In another illustrative example, the device resource monitor 520 can be configured to aggregate some (or all) of the resource information obtained from a CPU, NPU, DSP core, DDR bandwidth or memory subsystem, bus clock, etc., and may transmit the aggregated information to the decoder complexity switching engine 550 as the instantaneous and/or time window-averaged device load information. [0134] In some examples, the device load information can be obtained from a device resource monitor or device resource manager that is included in and/or implemented by the UE or other computing device that implements the neural network-based decoder. In some cases, the device load information can be obtained from a performance monitoring or profiling library included in and/or implemented by an operating system of the device that implements the neural decoder. In some aspects, the device resource monitor 520 can obtain and/or determine device load information based on inspecting a /proc/stat special file maintained by the operating system or kernel of the UE or computing device used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG. 5. In some cases, the device resource monitor 520 can be configured to utilize a perf_events system library to obtain and/or determine the device load information. [0135] The decoder profiling engine 530 can be used to obtain (e.g., monitor and output to the decoder complexity switching engine 550) decoding time information corresponding to the total elapsed decoding time for completion of decoding processing for one or more previously decoded frames processed by the neural decoder (e.g., previously decoded frames processed by respective ones of the plurality of decoder model instances 510-1, 510-2, …, 510-N). [0136] The decoder measurement and profiling engine 530 may obtain decoding time measurement information for the previously decoded frames of the input bitstream 502, and/or can be used to obtain decoder implementation profiling information corresponding to one or more of the plurality of decoder model instances 510-1, 510-2, …, 510-N. For Qualcomm Docket No.2400178WO example, the decoder implementation may include (e.g., and the decoder profiling engine 530 may utilize) one or more system calls configured to obtain a decoder start time (e.g., decoder start wall time) for one or more previously decoded frames of the bitstream 502. In some cases, one or more system calls can be configured to obtain a decoder end time (e.g., decoder end wall time) for the one or more previously decoded frames of the bitstream 502. The obtained decoder end time(s) may correspond to respective decoder start times. [0137] Based on the respective decoder start time and decoder end time for a respective one of the previously decoded frames of the bitstream 502, the decoder profiling engine 530 can determine a total elapsed processing time (e.g., a total elapsed decoding time) of the neural decoder in performing decoding processing for the respective one of the previously decoded frames. In some aspects, the decoder profiling information obtained by the decoder profiling engine 530 and provided to the decoder complexity switching engine 550 can be indicative of a total decoder processing time elapsed (e.g., wall time elapsed to decode previous frame(s)) to complete respective decoding of each previously decoded frame of a plurality of previously decoded frames. [0138] As noted above, the decoder complexity switching engine 550 can be used to provide dynamic decoder complexity scaling based on configuring the switch 545 to switch between coupling the input bitstream 502 (e.g., the input to the neural decoder for decoding or signal synthesis) to a particular (e.g., selected) one of the plurality of different decoder instances with different complexities (e.g., decoder 510-1 with a first synthesis complexity, decoder 510-2 with a second synthesis complexity, …, decoder 510-N with an Nth synthesis complexity, etc.). [0139] The complexity scaling performed and/or configured or controlled by the decoder complexity switching engine 550 can be based on device load information obtained from the device resource monitor 520, can be based on decoder measurement or profiling information from the decoder profiling engine 530 (e.g., including a decoder start time, decoder end time, decoder total elapsed processing time, etc., for one or more previously decoded frames, etc.), and/or can be based on both device load information from the device resource monitor and decoder measurement information from the decoder profiling engine 530. Qualcomm Docket No.2400178WO [0140] In some aspects, device load information and/or decoding time measurement information can be used (e.g., by the decoder complexity switching engine 550) to dynamically reduce the neural network-based decoder or synthesizer engine complexity based on a determination of a relatively high current device load state. For example, neural decoders (e.g., such as the neural decoder model instances 510-1, 510-2, …, 510- N) may be more computationally intensive than non-neural decoders to implement, including on resource-constrained devices such as a UE or other computing device that may be used to implement the neural decoder and the dynamic decoder complexity switching system 500 of FIG.5. [0141] A relatively high device load may be associated with the device performing multiple tasks simultaneously or in parallel. For example, in addition to receiving the bitstream 502 to be processed and decoded by one of the neural decoder instances, the device may be utilizing a portion of its computational resources and/or computational capacity for additional tasks such as a user browsing or consuming other media content in parallel to a voice call, audio call, video call, etc., that is associated with one or more speech frames within the bitstream 502 to be decoded by the neural decoder. In another example, the device may be used to perform a video call, and the bitstream 502 includes speech frame features (or other encoded representations of speech frames or speech signals) corresponding to the ongoing video call. The relatively high device load may be associated with the device simultaneously performing (e.g., performing in parallel) video decoding for the frames of video data of the video call, and performing audio decoding (e.g., speech or voice decoding) for the frames of audio data (e.g., frames of speech or voice data) that are also included in the video call and/or are associated with the frames of video data. [0142] In some aspects, when the device load is relatively low and/or is below one or more thresholds, a respective neural decoder model may be associated with a frame processing completion time that is within the configured real-time requirement or real- time limit associated with the frame type. For example, where the device load is relatively low, each of the neural decoder model instances 510-1, 510-2, 510-N may be associated with a speech frame processing completion time that is within the real-time requirement for decoding speech frames (e.g., each model may be able to decode a received speech frame in an amount of time less than or equal to the length of each speech frame and/or the periodicity between the decoder receiving consecutive speech frames in the bitstream Qualcomm Docket No.2400178WO 502). For instance, when the device load is relatively low, each of the neural decoder model instances 510-1, 510-2, …, 510-N may be able to complete speech frame processing for a respective speech frame within a 20ms real-time processing requirement that is equal to the length of each speech frame and/or is equal to the periodicity between consecutive speech frames included in the input bitstream 502 to the neural decoder. [0143] The neural decoder model instances 510-1, 510-2, …, 510-N may be associated with different complexities, and may each correspond to a different amount of device load or processing power for completing processing of a respective speech frame. When the device load is relatively high, one or more of the neural decoder model instances 510-1, 510-2, …, 510-N may be unable to complete processing (e.g., decoding) of a respective frame of the input bitstream 502 within the corresponding real-time processing requirement or limit for that frame. For example, a relatively high device load may correspond to fewer processing resources being made available to implement and run the neural decoder model instance, which can cause an increased (e.g., longer) frame processing time. [0144] In one illustrative example, the decoder complexity switching engine 550 can be used to dynamically increase the neural network-based decoder or synthesizer engine complexity based on a determination of a relatively low current device load state, based on the decoder complexity switching engine 550 analyzing the device load information from the device resource monitor 520. In some aspects, the decoder complexity switching engine 550 can be used to dynamically decrease the neural network-based decoder complexity based on a determination of a relatively high current device load state. Switching to a lower complexity neural decoder model instance can be performed to ensure real-time processing of the incoming frames of audio data (e.g., within bitstream 502) is still achieved. For example, a lower complexity neural decoder model instance may be used to complete audio frame decoding within the 20ms real-time requirement when the device load is relatively high and the neural decoder is able to use relatively few processing resources of the device. When the device load is relatively low, a higher complexity neural decoder model instance may be used to complete audio frame decoding within the 20ms real-time requirement with greater or improved accuracy or quality, based on the higher complexity neural decoder model instance being able to utilize relatively larger amounts of processing resources of the device when the overall device load is relatively low. Qualcomm Docket No.2400178WO [0145] In some aspects, the device load information can be used to dynamically reduce or increase the decoder complexity so that the decoder remains within the configured time limit for processing (e.g., decoding, synthesizing, etc.) each respective frame. For example, the device load information can be used by the decoder complexity switching engine 550 to configure or control the switch 545 to implement a particular selection of a decoder model instance and complexity out of the plurality of different decoder model instances and corresponding complexities available at the device (e.g., out of the plurality of decoder model instances 510-1, 510-2, …, 510-N, etc.), to thereby dynamically reduce or increase the decoder complexity so that the decoder remains within a 20ms real-time processing requirement associated with a stream of 20ms speech frames provided to the decoder in the input bitstream 502. [0146] In some cases, the decoder complexity switching engine 550 can select a lower complexity neural decoder model instance 510-1, 510-2, …, 510-N based on a determination that the current device load is relatively high and relatively fewer processing resources will be available to process the next frame of audio data using one of the neural decoder model instances 510-1, 510-2, …, 510-N. In some examples, the decoder complexity switching engine 550 can select a lower complexity neural decoder model instance 510-1, 510-2, …, 510-N based on a determination that the previously decoded frame (and/or multiple previously decoded frames) exceeded the real-time processing requirement or limit (e.g., one or more previous frames took longer than the 20ms speech frame real-time processing limit to complete processing and decoding). [0147] In some cases, switching to a lower complexity neural decoder model instance may be associated with an output quality degradation of the reconstructed or synthesized signal generated by the neural decoder. A graceful quality drop may be preferrable to the severe degradation and/or output signal discontinuity that may otherwise be caused by not switching to the lower complexity neural decoder model instance and violating (e.g., exceeding) the real time decoding time limit by continuing to use the relatively higher complexity neural decoder model instance. [0148] In some examples, the decoder complexity switching engine 550 can select a higher complexity neural decoder model instance 510-1, 510-2, …, 510-N based on a determination that the current device load is relatively low and relatively greater processing resources will be available to process the next frame of audio data using one Qualcomm Docket No.2400178WO of the neural decoder model instances 510-1, 510-2, …, 510-N. In some examples, the decoder complexity switching engine 550 can select a higher complexity neural decoder model instance 510-1, 510-2, …, 510-N based on a determination that the previously decoded frame (and/or multiple previously decoded frames) completed decoding processing before the real-time processing requirement or limit (e.g., one or more previous frames took less than the 20ms speech frame real-time processing limit to complete processing and decoding). In some cases, the decoder complexity switching engine 550 can increase the neural decoder model complexity based on a determination that one or more previous frames completed decoding processing early by an amount greater than or equal to a configured threshold. For example, the decoder complexity switching engine 550 can be configured to increase the neural decoder complexity based on one or more previous frames completing processing more than 10% ahead of the real- time processing requirement limit time (e.g., more than 2ms ahead of the 20ms real-time processing requirement limit time for 20ms speech or audio frames, etc.). [0149] In some aspects, the decoder complexity switching engine 550 can monitor and/or analyze in real-time the device load information obtained by device resource monitor 520 and/or the previous frame processing time information obtained from the decoder profiling engine 530. Based on the analysis, the decoder complexity switching engine 550 can be configured to generate an output indicative of a complexity scaling adjustment corresponding to increased decoder complexity (e.g., a selection of a decoder model instance 510-1, 510-2, …, 510-N with a greater complexity than the decoder used to process the previous frame or one or more previous frames), a scaling adjustment corresponding to decreased decoder complexity (e.g., a selection of a decoder model instance 510-1, 510-2, …, 510-N with a lesser complexity than the decoder used to process the previous frame or previous one or more frames), and/or a scaling adjustment corresponding to the same decoder complexity. In some aspects, maintaining the same decoder complexity for the current frame as the previous frame (or maintaining the same decoder complexity for the next frame as the current frame, etc.) can correspond to utilizing the same neural decoder model instance 510-1, 510-2, …, 510-N. In some cases, the plurality of neural decoder model instances 510-1, 510-2, …, 510-N may include multiple neural decoder model instances that are different from one another but have the same complexity, and maintaining the same decoder complexity can be associated with Qualcomm Docket No.2400178WO using the switch 545 to switch between different neural decoder model instances within the same level of complexity. [0150] In some examples, a complexity scaling indication can be generated and/or determined by the decoder complexity switching engine 550, and may be used to configure the switch 545 coupled between the input bitstream 502 received at the neural decoder system and the plurality of different neural decoder model instances 510-1, 510- 2, …, 510-N, where each decoder model instance of the plurality of decoder model instances is associated with a respective synthesis complexity. In some examples, each decoder model instance of the plurality of decoder model instances 510-1, 510-2, …, 510- N is associated with a different respective synthesis complexity. For example, a first subset of the plurality of decoder model instances may be associated with a relatively high synthesis complexity, a second subset of the plurality of decoder model instances may be associated with a relatively low synthesis complexity, etc. [0151] FIG. 6A is a diagram illustrating a plurality of example neural decoder architectures 600 each associated with a respective complexity, in accordance with some examples. FIG. 6B is a diagram illustrating additional example neural decoder architectures 650 each associated with a respective complexity, in accordance with some examples. [0152] In one illustrative example, the example neural decoder architectures 600 of FIG.6A and 650 of FIG.6B may be used to implement one or more neural decoder model instances of the plurality of neural decoder model instances 510-1, 510-2, …, 510-N of FIG. 5. For example, one or more of the neural decoder model instance 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, and/or 610-8 of FIGS. 6A and 6B may be used to implement one or more neural decoder model instances of the plurality of neural decoder model instances 510-1, 510-2, …, 510-N of FIG.5. [0153] As noted previously, the systems and techniques can implement dynamic decoder complexity scaling to increase or decrease a neural decoder complexity while remaining within a real-time processing limit associated with processing (e.g., decoding) one or more input frames received at the decoder. The neural decoder complexity scaling can be implemented based on a selection of a particular neural decoder model instance with a particular complexity, where the selection is made from a plurality of neural decoder model instances associated with different respective complexities. Qualcomm Docket No.2400178WO [0154] In some cases, the different decoder model instances can implement different complexities (e.g., synthesis complexity, decoding complexity, processing or processing time complexity, etc.) based on the respective decoder model instances utilizing different neural network and/or machine learning architectures. For example, a first neural network architecture may be used to implement a first level of complexity, a second neural network architecture may be used to implement a second level of complexity, etc. In one illustrative example, the neural decoder model instance 610-1 and the neural decoder model instance 610-2 of FIG. 6A may be associated with different respective complexities, based on the first neural decoder 610-1 having a different neural network architecture than that of the second neural decoder 610-2. In some cases, the first neural decoder architecture 610-1 may be associated with a higher complexity than the second neural decoder architecture 610-2. In some examples, the first neural decoder architecture 610-1 may be associated with a lesser complexity than the second neural decoder architecture 610-2. In some aspects, the first neural decoder architecture 610-1 and the second neural decoder architecture 610-2 may be different from one another, but may be associated with the same complexity. [0155] In another example, the different decoder model instances can implement different complexity based on the respective decoder model instances utilizing different neural network or machine learning weights or parameters. For example, a first set of neural network weights can be used to parameterize a neural network architecture to a first level of complexity, a second set of neural network weights can be used to parameter a same or different neural network architecture to a second level of complexity, etc. In one illustrative example, the third neural decoder model instance 610-3 of FIG.6A utilizes a set of weights wi and the fourth neural decoder model instance 610-4 of FIG.6A utilizes a set of weights wk that is different from the set of weights wi. The third neural decoder 610-3 and fourth neural decoder 610-4 may utilize the same neural network architecture, parameterized by the respective different sets of weight values wi and wk. In some cases, the neural decoder architecture 610-3 may be associated with a higher complexity than the neural decoder architecture 610-4. In some examples, the neural decoder architecture 610-3 may be associated with a lesser complexity than the neural decoder architecture 610-4. [0156] In some aspects, the different complexity decoder model instances may share one or more neural network layers and can scale in complexity based on combining the Qualcomm Docket No.2400178WO shared neural network layers with one or more complexity level-specific subsets of corresponding neural network layers. For example, a first complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the first complexity level. A second complexity level can be implemented based on combining the shared neural network layers with a subset of neural network layers corresponding to the second complexity level, etc. [0157] In one illustrative example, the neural decoder model instance 610-5 of FIG.6A represents a first configuration using a combination of a set of shared neural network layers and a subset of complexity-specific neural network layers “NN Layers A.” Additional subsets of complexity-specific neural network layers “NN Layers B”, …, “NN Layers N” may be included in the neural decoder and associated with the shared NN layers, but are not activated or used for the Complexity A configuration. [0158] The neural decoder model instance 610-6 of FIG. 6A can be the same as or similar to the neural decoder model instance 610-5, with a second configuration of neural network layers implemented to combine the same set of shared neural network layers with the second subset of complexity-specific neural network layers (“NN Layers B”). In the example configuration corresponding to neural decoder model instance 610-6, the first NN Layers A complexity-specific subset, …, and the Nth NN Layers N complexity- specific subset are deactivated or not used, while the second NN Layers B complexity- specific subset is selected and configured as the active subset of complexity-specific neural network layers. [0159] In some aspects, one or more lower complexity neural decoder model instances can be implemented using respective subsets of a plurality of neural network layers included in a highest complexity neural decoder model instance associated with the UE or other computing device. For example, one or more neural network layers and/or subsets of neural network layers of the highest complexity decoder model instance can be deactivated and bypassed to implement lower complexity level decoder model instances. In some examples, lower complexity decoders can be implemented by bypassing one or more neural network layers, subsets of neural network layers, components, processing steps, etc., of the highest complexity decoder model instance at the device. [0160] In one illustrative example, the neural decoder 610-7 of FIG. 6B may be a highest complexity decoder model instance available at a UE or other computing device Qualcomm Docket No.2400178WO implementing the systems and techniques described herein. For example, the highest complexity neural decoder 610-7 can include one or more processing blocks and/or groups or subsets of neural network layers, where respective processing blocks and/or subsets or groups of neural network layers may be deactivated and bypassed to reduce the decoder complexity and implement a reduced complexity decoder model instance. For instance, the neural decoder 610-7 includes a feature dequantization engine 642, a feature upsampling engine 644, and a neural network synthesis engine 646. In the highest complexity implementation for the neural decoder 610-7, the decoding processing pipeline can include each of the feature dequantization engine 642, the feature upsampling engine 644, and the neural network synthesis engine 646. [0161] The neural decoder model instance 610-8 can be lower complexity than the higher complexity neural decoder 610-7, and may be implemented based on deactivating one or more of the processing blocks and/or subsets of neural network layers that are activated in the higher complexity neural decoder 610-7. For example, the complexity of the neural decoder 610-7 can be reduced based on deactivating, within the lower complexity model instance 610-8, the neural network layers corresponding to the feature upsampling engine 644. [0162] FIG. 7 is a diagram illustrating an example of a dynamic decoder complexity scaling system 700 that can be used to implement a dynamic decoder complexity adjustment based on determining the currently elapsed execution time at one or more checkpoint during processing of a current frame (e.g., based on determining the elapsed time at one or more points in time prior to completion of the processing of the current frame). In one illustrative example, the dynamic decoder complexity adjustment can be based on determining the currently elapsed execution time (e.g., from the beginning of processing for the current frame) at one or more checkpoints corresponding to different layers and/or subsets or groups of layers of a plurality of layers included in the neural decoder model. [0163] For example, a neural decoder model of FIG.7 may include a plurality of neural network layers, including a first subset of decoder NN layers 715-1, a second subset of decoder NN layers 715-2, …, and an Nth subset of decoder NN layers 715-N. In some aspects, the subsets of decoder NN layers 715-1, 715-2, …, 715-N may be included in a respective neural decoder model instance of a plurality of neural decoder model instances, Qualcomm Docket No.2400178WO such as the plurality of neural decoder model instances 510-1, 510-2, …, 510-N of FIG. 5 and/or the neural decoder model instances 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, and/or 610-8 of FIGS.6A and 6B, etc. [0164] The systems and techniques can be configured to monitor the execution time of processing and/or decoding of the current frame with increased granularity during the current frame execution. For example, after the current frame data is processed by the first subset 715-1 of NN layers, the elapsed execution time can be determined at block 730-1. In some cases, the elapsed execution time determined at block 730-1 may correspond to the elapsed time between beginning processing of the current frame data by the neural decoder (e.g., beginning processing with the first subset 715-1 of NN decoder layers) and completing processing by the first subset 715-1 of NN decoder layers. [0165] At block 740-1, the systems and techniques can determine a remaining time budget based on the elapsed execution time determined at block 730-1. For example, the remaining time budget can be based on the difference between the configured real-time requirement or real-time processing limit for decoding each frame by the neural decoder (e.g., 20ms limit for a 20ms frame duration, etc.), and the elapsed execution time at block 730-1. In some aspects, the remaining time budget can be indicative of a remaining amount of time before the current frame execution (e.g., current frame decoding processing) equals or exceeds the configured real-time processing limit duration. [0166] In one illustrative example, the systems and techniques can compare the remaining time budget to an estimated processing time associated with processing the current frame data by each of the remaining NN decoder layer subsets 715-2, …, 715-N. In some aspects, the estimated processing time can include an estimated processing time for each respective subset of the remaining NN decoder layer subsets 715-2, …, 715-N. In some examples, the estimated processing time can include an aggregate processing time that is estimated as a single value for the one or more remaining NN decoder layer subsets 715-2, …, 715-N that have not yet begun or completed execution to process the current frame data. [0167] In some aspects, the determination at block 740-1 can compare the remaining time budget to one or more configured thresholds. For example, block 740-1 may determine if the remaining time budget is greater than 0 (e.g., the real-time processing requirement limit has not yet been reached or exceeded). Based on the remaining time Qualcomm Docket No.2400178WO budget being greater than or equal to 0, subsequent NN decoder layer subsets may be skipped, to expedite processing of the current frame data to remain within the real-time processing limit or to minimize the duration by which the real-time processing limit is exceeded for the current frame. [0168] At block 740-1, the remaining time budget can be used to determine whether the neural decoder should continue with processing of the current frame data using the next subset of NN decoder layers in the sequential order, or if the neural decoder should skip one or more subsequent subsets of NN decoder layers and/or exit early. For example, the remaining time budget can be used to determine whether the neural decoder should continue to process the current frame data using the sequential order of each subset of NN decoder layers associated with the processing pipeline that includes all of the NN decoder layer subsets 715-1, 715-2, …, 715-N, or if insufficient time remains to perform the full processing. [0169] For example, processing can proceed from the subset 1 decoder NN layers 715- 1 to the subset 2 decoder NN layers 715-2 based on a determination at block 740-1 that the estimated time remaining is less than or equal to the remaining time budget (e.g., processing can be completed by all decoder NN layer subsets within the remaining time budget and within the real-time processing limit for the current frame). [0170] In some examples, block 740-1 may correspond to a determination that the estimated time remaining is less than the remaining time budget (e.g., the remaining time budget is insufficient to perform full processing using each subset of NN decoder layers while remaining within and not exceeding the real-time processing requirement time limit of 20ms). In some aspects, one or more subsequent NN decoder layer subsets 715-2, …, 715-N can be skipped based on the remaining time budget being less than the estimated time to complete processing. For example, the second subset 715-2 of NN decoder layers can be skipped, and processing can proceed to the final (e.g., output layer) subset N of decoder NN layers 715-N. In some examples, all remaining NN decoder layers subsets may be skipped based on the remaining time budget being less than the estimated time to complete processing. In some aspects, the next NN decoder layer subset can be skipped, and the elapsed execution time can be compared again to an updated remaining time budget after the next NN decoder layer subset processing checkpoint is reached. [0171] In some cases, the elapsed execution time can be determined after each NN Qualcomm Docket No.2400178WO decoder layer subset. For example, block 730-2 can be the same as or similar to block 730-1, where block 730-2 is associated with determining the elapsed execution time for processing of the current frame data after finishing processing by the second NN decoder layer subset 715-2. In some examples, block 740-2 can be the same as or similar to block 740-1, where block 740-2 determines an updated time remaining budget corresponding to a decoder processing checkpoint that is after the first NN decoder layer subset 715-1 and the second NN decoder layer subset 715-2, and before the remaining NN decoder layer subsets 715-3, …, 715-N. [0172] In some aspects, the elapsed execution time determinations of block 730-1 and/or 730-2 (e.g., the elapsed time execution during processing for a current frame) can be performed after completing processing for each subset of NN decoder layers 715-1, 715-2, …, etc., prior to the final (e.g., output) subset of NN decoder layers 715-N. [0173] In some cases, the elapsed execution time determinations of block 730-1 and/or 730-2 (e.g., the elapsed time execution during processing for a current frame) can be performed after completing processing for each subset of NN decoder layers that is immediately prior to a skippable subset of NN decoder layers, where the determined elapsed execution time is used to determine whether or not the skippable subset of NN decoder layers will be skipped or will be used to process the current frame data. [0174] In one illustrative example, the elapsed execution time determinations of block 730-1 and/or 730-2 (e.g., the elapsed time execution during processing for a current frame) can be performed using the decoder profiling engine 530 of FIG.5. For example, the respective elapsed execution time measurements corresponding to particular intermediate NN decoder layers or intermediate subsets of NN decoder layers between the first (e.g., input) subset 715-1 and the last (e.g., output) subset 715-N of decoder NN layers can be obtained by the decoder profiling engine 530 and/or can be included in decoder measurement time information or decoder profile information obtained by or associated with the decoder profiling engine 530 of FIG. 5. In some aspects, the determination of whether to skip one or more subsequent NN decoder layers or subsets of NN decoder layers can be associated with and/or performed by the decoder complexity switching engine 550 of FIG.5. [0175] In one illustrative example, the systems and techniques described herein may be self-contained and/or implemented by a single UE or other mobile computing device Qualcomm Docket No.2400178WO associated with an audio call and/or audio frame processing (e.g., a voice-only call, a video call including video frames and audio frames, etc.). The UE or mobile computing device may be unable to plan for or control the device load associated with the device load information and device resource monitor 520 of FIG. 5. For example, the device load, CPU and/or other processor utilization, memory utilization, etc., may be factors that a voice call application executing the neural decoder and dynamic decoder complexity scaling system 500 on the UE cannot control or schedule. In some aspects, the dynamic decoder complexity scaling system 500 can be used to react in real-time to the observed device load information, CPU clock frequency or utilization, etc., that exist for the UE at the time a current frame begins processing in the neural decoder and/or at the time a current frame is schedule to being processing in the neural decoder. [0176] In some aspects, the complexity scaling for the neural decoder and the selection of a particular neural decoder model instance (e.g., from the plurality of different neural decoder model instances 510-1, 510-2, …, 510-N associated with different respective complexities, can be performed for each frame of a plurality of frames received by the neural decoder in the bitstream 502. For example, based on device load information indicative of increased CPU load, the decoder complexity may be scaled down to a lower complexity neural decoder model instance for one or more frames until the device load decreases. Based on the device load decreasing, the neural decoder complexity can be scaled up to a higher complexity neural decoder model instance in response. [0177] In some aspects, the systems and techniques described herein can be used to provide dynamic complexity adjustment for one or more of a decoder and/or an encoder included in an audio codec system, such as the audio codec system 800 of FIG. 8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG.10, etc. In some examples, the decoder complexity switching engine 550 of FIG. 5 can be included in a decoder 810 of the audio codec system 800 of FIG. 8, a decoder 910 of the audio codec system 900 of FIG.9, a decoder 1010 of the audio codec system 1000 of FIG.10, etc. In some cases, one or more of the device resource monitor 520 of FIG.5 and/or the decoder profiling engine 530 can be included in the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc. [0178] In some aspects, one or more of the decoder 810 of the audio codec system 800 Qualcomm Docket No.2400178WO of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc., can be included in the plurality of decoders with different synthesis complexity 510-1, 510-2, …, 510-N of FIG.5. [0179] In some aspects, one or more of the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc., can be included in the plurality of decoders with different synthesis complexity 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, 610- 8 of FIGS.6A-6B. [0180] FIG.8 is a block diagram illustrating an example of an audio codec system 800 that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer. In some aspects, the audio codec system 800 can also be referred to as a voice coding system (e.g., vocoder), a signal synthesis system, a voice coding signal synthesis system, etc. In one illustrative example, the audio codec system 800 can be a generative audio codec (e.g., a generative voice codec) including one or more feedback recurrent autoencoders (FRAEs) that may be used to generate encoded audio data, and one or more neural synthesizers that may be used to generate reconstructed audio data based on the encoded audio data. [0181] For example, the audio codec system 800 (e.g., a generative voice codec) can include an encoder 805 (e.g., a transmitter) and a decoder 810 (e.g., a receiver). The encoder 805 can be used to generate encoded features that are transmitted to the decoder 810 over a channel 840. For example, the encoder 805 can receive audio 815 and generate encoded features corresponding to the audio 815. The encoded features can be generated using a feedback recurrent autoencoder (FRAE) 825 included in the encoder 805. In some examples, the encoder 805 can include one or more FRAEs, where each FRAE of the one or more FRAEs is configured to generate a corresponding one or more encoded features associated with the audio 815. In some aspects, the audio 815 may include and/or comprise a speech signal, a voice signal, etc. [0182] The decoder 810 can receive encoded features (e.g., from the encoder 805, over the channel 840) and generate reconstructed audio 855 based at least in part on the encoded features received from the encoder 805. The reconstructed audio 855 may include and/or comprise a reconstructed speech signal, a reconstructed voice signal, etc. The reconstructed audio 855 can be generated using a neural network-based speech Qualcomm Docket No.2400178WO synthesizer 850 (e.g., also referred to as a neural speech synthesizer and/or a neural synthesizer) included in the decoder 810. The reconstructed audio 855 generated using the neural speech synthesizer 850 may also be referred to as synthesized speech. In some examples, the decoder 810 can include an FRAE decoder 845, which can be used to decode the encoded features received from the encoder 805. In some aspects, the FRAE decoder 845 can be the same as or similar to a decoder implemented by the FRAE autoencoder 825 included in the encoder 805. In some examples, the decoder 810 can include one or more FRAE decoders 845, which can correspond to one or more FRAE autoencoders 825 included in the encoder 805. For example, the number of FRAE decoders 845 included in the decoder 810 can be equal to the number of FRAE autoencoders 825 included in the encoder 805. [0183] In some examples, the audio 815 includes speech. The audio 815 may be an example of the speech signal 251 of FIG.2B, and/or the speech signal 303 of FIG.3, etc. The encoder 805 of the generative voice codec system 800 can be configured to generate one or more encoded representations of the input audio 815. For example, the one or more encoded representations can include spectral envelope information associated with the audio 815, and/or can include pitch information associated with the audio 815, etc. In some aspects, the encoder 805 can extract spectral envelope features from the audio 815 using spectral envelope feature extraction 820, and can encode the extracted spectral envelope features using the FRAE 825. The encoded spectral features z can be transmitted from the FRAE 825 to the decoder 810, using the channel 840. In some examples, the encoder 805 can extract pitch information from the audio 815 using pitch extraction engine 830, and can perform quantization 835 to generate quantized pitch information associated with the audio 815. The quantized pitch information can be transmitted from the pitch quantizer 835 to the decoder 810, using the channel 840. [0184] In some aspects, the encoder 805 can be configured to extract spectral features (e.g., spectral envelope features) from the audio 815 using the spectral envelope feature extraction engine 820. In one illustrative example, the spectral envelope feature extraction engine 820 can perform a cepstrum computation to determine cepstrum information and/or one or more cepstral coefficients corresponding to the audio 815. In some cases, the spectral features extracted from the audio 815 using the spectral envelope feature extraction 820 can include mel-frequency cepstral coefficients (MFCC), such as MFCC- 24 features (e.g., 24-dimensional MFCC features). In some examples, the spectral Qualcomm Docket No.2400178WO envelope feature extraction 820 applies one or more mel-scaled filter(s) (e.g., a mel-scaled filterbank), a logarithmic compression, and/or a discrete cosine transform (DCT) to the audio 815 (e.g., and/or to a magnitude spectrum associated with the audio 815). The encoder 805 processes the extracted spectral features using a feedback recurrent autoencoder (FRAE) 825 to generate encoded features z. The FRAE 825 includes a decoder and an encoder. The FRAE 825 can implement feedback of state information h between the decoder and the encoder included in the FRAE 825. For example, the state information h can be determined or obtained at the decoder of the FRAE 825, and feedback of the state information h can be performed for the decoder of the FRAE 825 (e.g., the state information h is fed back to the decoder of the FRAE 825) and for the encoder of the FRAE 825 (e.g., the state information h is fed back from the decoder of the FRAE 825 to the encoder of the FRAE 825). [0185] In some examples, the spectral envelope feature extraction engine 820 can be configured to generate and/or determine (e.g., extract) one or more types of spectral envelope features. For example, the extracted spectral envelope features may include cepstrum information and/or cepstral coefficients. In some cases, cepstral liftering can be performed based on DCT truncation to exclude or remove pitch information and only capture spectral envelope information in the extracted cepstrum or cepstral coefficients. In some examples, the extracted spectral envelope features can include companded (e.g., log) filterbank energies associated with the input audio 815. For example, companded filterbank energies can be determined based on applying an inverse DCT (e.g., IDCT) to a liftered cepstrum. In some cases, the extracted spectral envelope features can include filterbank energies determined based on uncompanding (e.g., exp) the companded filterbank energies. In some examples, the extracted spectral envelope features can include a full resolution spectrum (e.g., DFT domain with companded (e.g., log, linear amplitude, etc.) information, etc.). The full resolution spectrum can be smoothed to the envelope of the spectrum, for example based on interpolating the companded or uncompanded filterbank energies. In some aspects, the extracted spectral envelope features can include one or more linear prediction (LP) coefficients, for example determined based on processing one or more speech frames of the input audio 815 using the Levinson-Durbin algorithm (e.g., autocorrelation technique), and/or using a covariance technique. In some examples, the extracted spectral envelope features can include one or more of line spectral frequency (LSF) information and/or line spectral pair Qualcomm Docket No.2400178WO (LSP) information. In some examples, the spectral envelope feature extraction engine 820 can be configured to extract a first set of one or more spectral features comprising cepstrum information and/or cepstral coefficients, and to extract a second set of one or more spectral features comprising LP coefficients, LSF information, and/or LSP information. For example, the second set of one or more spectral features can include LP coefficients (LPCs) estimated by the spectral envelope feature extraction engine 820 based on the Levinson-Durbin algorithm or other speech autocorrelation techniques. In some examples, the spectral envelope feature extraction engine 820 can estimate one or more LP coefficients from spectral envelope features (e.g., the first set of one or more spectral features comprising cepstrum information and/or cepstral coefficients, etc.), using a neural network-based technique or a DSP-based technique. In some aspects, LSP coefficients can be determined by the spectral envelope feature extraction engine 820 based on autocorrelation of the speech signal 815. In some examples, the spectral envelope feature extraction engine 820 can determine one or more LSPs, and can convert the one or more LSPs to a corresponding one or more LSFs. [0186] In some aspects, the encoded features or encoded feature information transmitted from the generative voice codec system encoder 805 can comprise a latent representation z between the encoder of the FRAE 825 and the decoder of the FRAE 825. The encoder 805 can be configured to pass (e.g., transmit) the encoded features z through a channel 840 to the decoder 810. [0187] The encoder 805 also processes the audio 815 using a pitch extraction engine 830 to extract pitch information from the audio 815. In some aspects, the audio 815 can be processed in parallel by the spectral envelope feature extraction engine 820 and the pitch extraction engine 830. In some cases, the encoder 805 can include a denoiser 818 that processes the input audio 815 and provides a de-noised audio to the spectral envelope feature extraction engine 820 and/or to the pitch extraction engine 830. [0188] Based on the audio 815, the pitch extraction engine 830 can generate one or more types of pitch information. For example, the pitch extraction engine 830 can output pitch information in the frequency-domain (e.g., f0 pitch information of the audio 815, in units of Hertz (Hz)) and/or can output pitch information in the time-domain (e.g., pitch lag in samples, with or without a fractional component, and/or pitch lag in milliseconds). In some cases, the pitch extraction engine 830 can be used to generate pitch estimation Qualcomm Docket No.2400178WO indicative of a pitch lag (or pitch delay) from the audio 815, and/or to identify a pitch correlation from the audio 815. In some aspects, the pitch information generated by the pitch extraction engine 830 can include a pitch lag and a pitch correlation. [0189] The encoder 805 can be configured to processes the pitch information of the audio 815 (e.g., pitch, pitch lag, and/or pitch correlation, determined using the pitch extraction engine 830) using a quantizer 835 Q() to generate a quantized pitch signal. The encoder 805 passes the quantized pitch signal through the channel 840 to the decoder 810. In some examples, the quantizer 835 Q() may also be referred to as a pitch quantizer. In some aspects, the quantizer 835 Q() can perform vector quantization (VQ) to generate the quantized pitch signal for transmission to the decoder 810 over the channel 840. In some examples, the quantizer 835 Q() can perform single-stage VQ and/or can perform multi- stage VQ (e.g., MSVQ). In some cases, the quantizer 835 Q() can be implemented using one or more FRAEs. In some aspects, the quantizer 835 Q() can be configured to implement forward error correction (FEC) for the quantized pitch signal that is transmitted to the decoder 810 over the channel 840. For example, the quantizer 835 Q() can implement multiple description coding (MDC) for the transmission of the quantized pitch signal over the channel 840, can implement full-redundancy FEC for the transmission of the quantized pitch signal over the channel 840, etc. [0190] The generative voice codec system decoder 810 can be configured to receive the encoded features z from the FRAE 825 included in the generative voice codec system encoder 805. For example, the decoder 810 can receive the encoded features z via the channel 840. The decoder 810 decodes the encoded features z using a FRAE 845 to generate decoded features. In some examples, the FRAE 845 of the decoder 810 includes only a decoder, without an encoder. In some examples, the FRAE 845 of the decoder 810 can include an encoder. In some examples, the decoded features generated by the FRAE 845 include mel-frequency cepstral coefficients (MFCC), such as MFCC-24 features (e.g., 24-dimensional MFCC features). [0191] The decoded features determined using the FRAE decoder 845 can be provided to a neural speech synthesizer 850, which can be configured to generate a reconstructed audio 855 (e.g., synthesized speech) based at least in part on the decoded spectral envelope features from the FRAE decoder 845. In one illustrative example, the neural speech synthesizer 850 receives as input the decoded spectral envelope features (e.g., Qualcomm Docket No.2400178WO from the FRAE decoder 845) and the received pitch encoding information (e.g., the quantized pitch signal received by the decoder 810 over the channel 840 and from the pitch quantizer 835 of the encoder 805). In some aspects, the neural speech synthesizer 850 can generate the reconstructed audio 855 (e.g., synthesized speech) based on processing the decoded spectral envelope features along with a pitch signal (e.g., the quantized pitch signal or a reconstructed variant thereof). [0192] For example, the decoder 810 can receive the quantized pitch signal (e.g., indicative of the pitch, the pitch lag, and/or the pitch correlation of the audio 815) from the encoder 805 via the channel 840. In some examples, the decoder 810 passes the quantized pitch signal to the neural speech synthesizer 850, and the neural speech synthesizer 850 processes the decoded spectral envelope features along with the quantized pitch signal to generate reconstructed audio 855 (e.g., synthesized speech). [0193] In some cases, the decoder 810 can receive the quantized pitch signal (e.g., indicative of the pitch, the pitch lag, and/or the pitch correlation of the audio 815) from the encoder 805 via the channel 840, and can perform dequantization of the quantized pitch signal to obtain a reconstructed pitch signal. The reconstructed pitch signal can be passed to the neural speech synthesizer, and used in combination with the decoded spectral envelope features from the FRAE decoder 845 to generate the reconstructed audio 855 (e.g., synthesized speech). In some examples, the decoder 810 can include a reconstruction engine that is separate from the neural speech synthesizer 850. The reconstruction engine processes the quantized pitch signal to reconstruct the pitch signal (e.g., including the pitch, the pitch lag, and/or the pitch correlation) before the neural speech synthesizer 850 using the reconstructed pitch signal (e.g., the pitch, the pitch lag, and/or the pitch correlation) to generate the reconstructed audio 855. [0194] In some examples, the output of the neural speech synthesizer 850 can be provided to one or more linear predictive coding (LPC) layers 852 included in the decoder 810. The LPC 852 can be used to perform linear prediction analysis and/or linear prediction synthesis, based on the output of the neural speech synthesizer 850. For example, the LPC 852 can perform linear prediction based on the output of the neural speech synthesizer 850, to generate the reconstructed audio 855 (e.g., synthesized speech). In some aspects, the decoder 810 does not include the LPC 852, and the neural speech synthesizer 850 can be trained and/or configured to generate the reconstructed Qualcomm Docket No.2400178WO audio 855 directly (e.g., the output of the neural speech synthesizer 850 can be the reconstructed audio 855). [0195] In some aspects, the LPC layers 852 can be used to implement a linear prediction (LP) synthesis filter, based on: ^^^^^^ ൌ ∑ ^ ^ୀ^ ^^^^^^^^ െ ^^^ ^ ^^^^^^. Here, ^^^^^^ represents the input signal to the LP filter (e.g., the input signal to the LPC layers 852), ^^^^^^ represents the output signal (e.g., the output of the LPC layers 852, which can be the reconstructed audio 855), ^^^ represents the linear prediction coefficients associated with implementing the LP synthesis filter, and ^^ is a value corresponding to the LP filter order. For example, the LP filter order p can have a value of 16 or 12, etc., among various other LP filter order values. [0196] In some aspects, the LP synthesis filter associated with the LPC layers 852 can be implemented in the time domain or in the frequency domain. For example, in the time domain, the LP synthesis filter can be implemented based on a difference equation. In some cases, the LP synthesis filter can be implemented in the time domain by convolving the input signal x[n] with the LP filter impulse response or an approximation of the LP filter impulse response. For example, the LP filter can be associated with an infinite impulse response (IIR), which can be approximated using a finite segment of the IIR (e.g., such as the first N samples of the IIR for a configured integer value of N, etc.). The LP synthesis filter associated with the LPC layers 852 can be implemented in the frequency domain based on multiplying the FFT of the input signal x[n] with the frequency response of the LP filter, and then determining the IFFT of the result to convert the output of the frequency domain LP synthesis filter into a time domain signal corresponding to the reconstructed audio 855. [0197] In some examples, the LP coefficients ^^^ can be estimated from the decoded spectral envelope features obtained using the FRAE decoder 845. In some cases, the spectral envelope features can be LP coefficients (e.g., determined by the spectral envelope feature extraction engine 820 based on using the Levinson-Durbin algorithm or other autocorrelation technique, or using a covariance technique, to process the input speech frames of the audio 815), and the decoded LP coefficients on the decoder side can be used for the LP filter implemented by the LPC layers 852. [0198] In some examples, the extracted spectral envelope features may be LSFs or LSPs, and the decoded LSFs or LSPs obtained using the FRAE decoder 845 can be Qualcomm Docket No.2400178WO converted to the corresponding LP coefficients ^^^. In some cases, the LP coefficients ^^^ of the LPC layers 852 can be estimated from the spectral envelope features using a neural network-based and/or a DSP-based technique. For example, the extracted spectral envelope features may comprise a cepstrum, and a DSP-based technique can be used to convert the decoded cepstrum into filterbank energies, and subsequently interpolate the filterbank energies to estimate an FFT square magnitude at all FFT bins (e.g., including bins beyond the centers of filterbank filters). After estimating the FFT square magnitude based on the interpolated filterbank energies, the DSP-based technique can include applying an IFFT to obtain an estimate of the autocorrelation sequence, and utilizing the Levinson-Durbin algorithm on the estimated autocorrelation sequence to obtain the LP coefficients ^^^. [0199] In some examples, the spectral envelope feature extraction 820 can be performed based at least in part on spectral feature learning. For example, spectral feature learning can be performed by one or more machine learning models configured and/or trained to generate as output the one or more extracted spectral envelope features. In some cases, a cepstrum or cepstrum information may be an optional input to a machine learning spectral feature learning model used to implement the spectral envelope feature extraction 820. In some cases, the spectral envelope feature extraction 820 can be implemented using one or more machine learning spectral feature learning models or engines, which can be jointly trained with the neural speech synthesizer 850, an NHV used to implement the neural speech synthesizer (e.g., such as the NHV-based neural speech synthesizer 950 of FIG. 9), an LPC network used to implement the neural speech synthesizer (e.g., such as the LPC network-based neural speech synthesizer 1050 of FIG.10), etc. [0200] In some cases, joint training of the spectral envelope feature extraction 820 and the neural speech synthesizer 850 can be performed based on a short-time Fourier transform (STFT) loss and/or STFT loss function. In some aspects, joint training of the spectral envelope feature extraction 820 and the neural speech synthesizer 850 can be performed based on an adversarial and/or generative adversarial network (GAN) loss. For example, joint training can be performed using a multi-resolution STFT loss, based on calculating STFT amplitude spectrograms from ground truth speech information and the synthesized speech information 855. The multi-resolution STFT loss can then be determined as a sum of mean absolute error and mean absolute error in the log domain. The STFT amplitude spectrograms can be calculated at different window lengths to Qualcomm Docket No.2400178WO capture representations of the error and/or loss at different time and/or frequency resolutions. In some aspects, joint training based on adversarial or GAN loss can be used to learn temporal fine structures in speech signals. An adversarial or GAN loss can be used to match the distribution of real speech and the synthesized speech 855. For example, the neural speech synthesizer 850 (and/or an NHV-based neural speech synthesizer) can be used as the generator network. A separate discriminator network can be used to train based on the adversarial loss, with both the NHV and the discriminator jointly trained using adversarial training techniques. In some examples, the discriminator network can be implemented as a classifier configured to classify whether an input audio represents real speech or synthesized speech. The generator network can attempt to fool the discriminator network to classify synthesized speech as real speech. Over the course of training, the generator network improves and begins producing (e.g., generating or synthesizing) speech that is very similar to the real speech. In some cases, the discriminator network can be implemented using WaveNet. [0201] The neural speech synthesizer 850 can be implemented using one or more trained neural networks. For example, the one or more trained neural networks can be trained to perform speech synthesis and/or signal synthesis. In some examples, the neural speech synthesizer 850 can be implemented using or based on an LPCNet machine learning architecture, a WaveNet machine learning architecture, a WaveRNN machine learning architecture, etc. In one illustrative example, the neural speech synthesizer 850 can be provided as a neural homomorphic vocoder (NHV). [0202] For example, FIG.9 illustrates an example of an audio codec system 900 (e.g., generative voice codec) that includes a decoder 910 configured to implement an NHV- based neural speech synthesizer. In some examples, an NHV is a type of neural vocoder that can synthesize speech with source-filter models controlled by one or more neural networks. An NHV-based neural vocoder may include one or more neural networks in a source-filter model that can synthesize speech based on filtering impulse trains and noise with linear time-varying (LTV) filters, with the one or more neural networks used to control the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features. Traditional or non-neural vocoders may operate based on decomposing speech into various parameters such as pitch, timbre, rhythm, etc., which can subsequently be manipulated and resynthesized to generate a desired output audio or voice signal. Neural vocoders (e.g., such as NHVs, etc.) can apply transformations and Qualcomm Docket No.2400178WO manipulations to speech signals directly within a learned feature space of the neural network. The learned feature space used by neural vocoders may capture more complex relationships and characteristics of speech than non-neural network-based signal processing and/or vocoder techniques. For example, neural vocoders can be trained to learn a mapping between raw speech waveforms and the spectral or cepstral representations of the speech. An NHV system can apply one or more homomorphic processing techniques within the learned space corresponding to the mapping. [0203] As noted above, FIG. 9 is a block diagram illustrating an example of an audio codec system 900 (e.g., a generative voice codec, a voice coding signal synthesis system, etc.) that includes a decoder 910 that can be used to generate reconstructed audio (e.g., synthesized speech) using an NHV-based neural speech synthesizer 950, in accordance with some examples. In some aspects, the decoder 910 may also be referred to as an NHV synthesizer decoder. In some examples, the audio codec system 900 of FIG.9 can be the same as or similar to the audio codec system 800 of FIG. 8. In some cases, the channel 940 of FIG. 9 can be the same as or similar to the channel 840 of FIG. 8, and may be associated with an encoder that is the same as or similar to the encoder 805 of FIG.8, etc. [0204] In one illustrative example, the NHV synthesizer decoder 910 can include the FRAE decoder 945, which may be the same as or similar to the FRAE decoder 845 of FIG.8, and may include a pitch de-quantization engine 938 (e.g., de Q()). The pitch de- quantization engine 938 can be used to process a received pitch encoding obtained by the decoder 910 over the channel 940. For example, the received pitch encoding information can be quantized pitch information or a quantized pitch signal transmitted to the decoder 910 over the channel 940 by a corresponding encoder (e.g., an encoder associated with the decoder 910, which may be the same as or similar to the encoder 805 of FIG.8, etc.). In some aspects, the pitch dequantization engine 938 can generate reconstructed pitch information based on performing a codebook lookup to dequantize the received pitch encoding obtained from the channel 940 (e.g., to dequantize the quantized pitch encoding received by the decoder 910 over the channel 940). [0205] The dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 938 can include pitch information in the frequency-domain (e.g., f0 pitch frequency information in Hz), pitch information in the time-domain (e.g., pitch lag or pitch delay information in samples, milliseconds, etc.), and/or pitch correlation Qualcomm Docket No.2400178WO information. In some aspects, the dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 938 can include information indicative of a voiced or unvoiced (V/UV) classification. [0206] In one illustrative example, the neural speech synthesizer 950 can be implemented as a neural homomorphic vocoder (NHV) speech synthesizer and/or an NHV-based neural speech synthesizer. For example, the neural speech synthesizer 950 can include a neural filter estimator 952 configured to generate respective filter specification or filter configuration information to parameterize one or more linear time- varying (LTV) filters of the NHV speech synthesizer 950. In some aspects, the neural filter estimator 952 can be trained to generate respective filter coefficients for a noise linear time-varying filter (LTVF) 975 and to generate respective filter coefficients for a harmonic LTVF 965. [0207] The neural network model of the neural filter estimator 952 can include any neural network architecture that can be trained to model the filter coefficients for the noise LTVF 975 and the harmonic LTVF 965 (e.g., and/or that can be trained to model the filter coefficients for one or more additional filters implemented by the NHV speech synthesizer 950, in either the time-domain, the frequency-domain, or combinations thereof). Examples of neural network architectures that may be included in the neural network filter estimator 952 can include a generative neural network (e.g., a generative- adversarial network (GAN)), convolutional neural networks (CNN), an autoencoder, and/or other type(s) of neural networks, etc. [0208] In some examples, the neural filter estimator 952 can be configured to receive a stream of decoded spectral envelope features from the FRAE decoder 945. Based on the decoded spectral envelope features, the neural filter estimator 952 can generate corresponding filter characterization parameters for each filter of one or more filters included in the NHV speech synthesizer 950. The respective filter characterization parameters can be output from the neural filter estimator 952 and used to parameterize and/or configure corresponding learned filters (e.g., learned linear filters, etc.) for each respective set of filter characterization parameters. For example, the neural filter estimator 952 can generate a first set of filters corresponding to the noise LTVF 975, can generate a second set of filter characterization parameters corresponding to the harmonic LTVF 965, etc. The one or more filters included in the NHV speech synthesizer 950 (e.g., the Qualcomm Docket No.2400178WO noise LTVF 975, the harmonic LTVF 965, etc.) can be specified in various different forms. For example, the noise LTVF 975, the harmonic LTVF 965, and/or various other filters that may be included in the NHV speech synthesizer 950 can be specified or characterized (e.g., using the filter characterization parameters determined by the neural filter estimator 952) as cepstrums, as frequency responses, as time-domain impulse responses, as difference equation coefficients, etc. [0209] In some aspects, the neural filter estimator 952 can be configured to receive as input the decoded spectral envelope features from the FRAE decoder 945, and may additionally receive as input at least a portion of the dequantized pitch information generated by the pitch dequantization engine 938. For example, the neural filter estimator 952 may receive as input the decoded spectral envelope features and pitch dequantization information (e.g., such as f0 pitch information in the frequency domain, pitch lag or pitch delay in the time domain (e.g., in units of samples or milliseconds, etc.), etc.). In some examples, the pitch dequantization information provided as input to the neural filter estimator 952 may include voiced/unvoiced (V/UV) classification information indicating whether the underlying audio represented in the encoded information received by the decoder 910 over the channel 940 corresponds to voiced or unvoiced sounds, speech, etc. [0210] In examples where the neural filter estimator 952 receives spectral envelope features and dequantized pitch information as inputs, the neural filter estimator 952 can generate the corresponding filter characterization parameters for each NHV filter (e.g., noise LTVF 975, harmonic LTVF 965, etc.) based on the decoded spectral envelope features and the dequantized pitch information. [0211] The noise LTVF 975 and the harmonic LTVF 965 can be implemented as time domain filters or frequency domain filters. In some examples, the noise LTVF 975 and the harmonic LTVF 965 can be implemented in the time domain or the frequency domain, independent of whether the neural filter estimator 952 is configured to generate the corresponding filter characterization parameters in the time domain or the frequency domain. For example, in some aspects, the neural filter estimator 952 can output impulse response-based filter characterization parameters, and the operation of noise LTVF 975 and/or harmonic LTVF 965 can be implemented in the time domain as convolutions with the impulse response. Using the same impulse response-based filter characterization parameters, the operation of noise LTVF 975 and/or harmonic LTVF 965 may be Qualcomm Docket No.2400178WO implemented in the frequency domain based on converting the impulse response to frequency response (e.g., using an FFT transform) and multiplying with the FFT of the input signal, and subsequently converting back to the time domain using an IFFT transform. [0212] In one illustrative example, the NHV speech synthesizer 950 can include a pulse train generator 960. The dequantized pitch information (e.g., generated using the pitch dequantization engine 938) can be provided to the pulse train generator 960. In some aspects, the pulse train generator 960 can generate a pulse train based at least in part on the pitch frequency (e.g., f0) and/or pitch lag or pitch delay information obtained from the pitch dequantization engine 938 of the decoder 910. In some examples, the pulse train generator 960 can be implemented as a cosine sum pulse generator, which can be configured to process the dequantized pitch information (e.g., obtained from the pitch dequantization engine 938) to generate a pulse train p[n]. The pulse train may also be referred to as an impulse train. For example, the pulse train generator 960 can be a differentiable cosine sum pulse generator, and/or can be a non-differentiable cosine sum pulse generator (e.g., among various other pulse generators). [0213] The pulse train generated by the pulse train generator 960 can be processed by the harmonic LTVF 965, using the corresponding harmonic filter characterization parameters determined by the neural filter estimator 952 for the harmonic LTVF 965. In some aspects, the harmonic LTVF 965 can generate a harmonic output based on processing the pulse train from the pulse train generator 960. [0214] In some examples, the pulse train generator 960 may receive an additional input from the pitch dequantization engine 938, indicative of a voice or unvoiced (e.g., V/UV) classification. For example, based on receiving an unvoiced (UV) indication or classification from the pitch dequantization engine 938, the pulse train generator 960 can be configured to generate a 0 output or a noise output that is provided to the harmonic LTVF 965 instead of the pulse train p[n] (e.g., instead of the pulse train p[n] provided from the pulse train generator 960 to the harmonic LTVF 965 in response to a voiced (V) indication or classification from the pitch dequantization engine 938). [0215] The NHV speech synthesizer 950 can include a noise generator 970 that is configured to generate a noise signal (e.g., white noise, etc.) for processing by the noise LTVF 975. For example, the noise LTVF 975 can be parameterized based on the Qualcomm Docket No.2400178WO respective noise filter characterization parameters generated by the neural filter estimator 952 for the noise LTVF 975, and can subsequently be used to process the noise signal generated by the noise generator 970. Based on processing the noise signal from the noise generator 970, the noise LTVF 975 can generate a noise-filtered output. [0216] The NHV speech synthesizer 950 can include a combination function 980 to combine the harmonic output (e.g., from the harmonic LTVF 965) and the noise-filtered output (e.g., from the noise LTVF 975) to generate the reconstructed audio 955. In some examples, the reconstructed audio 955 includes synthesized speech. In some aspects, the reconstructed audio 955 can be the same as or similar to the reconstructed audio 855 of FIG. 8. In some examples, the combined noise LTVF 975 filter output and harmonic LTVF 965 filter output (e.g., from the combination operation 980) can comprise the reconstructed audio 955 (e.g., synthesized speech). In some aspects, the output of the combination operation 980 can be provided to an LPC 954, which may be the same as or similar to the LPC 852 of FIG.8. For example, the LPC 954 can perform linear predictive coding on the LTVF filter output from the NHV speech synthesizer 950, and may generate as output the reconstructed audio 955 (e.g., synthesized speech). In some examples, a linear time invariant (LTI) post-filter can be included between the output of the combination operation 980 and the input to the LPC 954. [0217] FIG.10 is a block diagram illustrating an example of a decoder 1010 of a voice coding signal synthesis system that can be used to generate reconstructed audio (e.g., synthesized speech) using a neural speech synthesizer comprising a linear prediction coding (LPC) network. The decoder 1010 can be included in an audio codec system (e.g., generative voice codec system) 1000 that can be used to generate reconstructed audio (e.g., synthesized speech) 1055. [0218] In some aspects, the generative voice codec system 1000 of FIG. 10 can be the same as or similar to the generative voice codec system 800 of FIG.8, 900 of FIG.9, etc. In some examples, the neural speech synthesizer 1050 can receive a first input comprising decoded spectral envelope features (e.g., determined by the FRAE decoder 1045). The neural speech synthesizer 1050 can receive a second input comprising a received pitch encoding (e.g., received pitch information), that is the same as or similar to the received pitch encoding provided to the neural speech synthesizer 850 of FIG.8, etc. For example, the decoder 1010 may include a pitch de-quantization engine 1038 (e.g., de Q()). The Qualcomm Docket No.2400178WO pitch de-quantization engine 1038 can be used to process a received pitch encoding obtained by the decoder 1010 over the channel 1040. For example, the received pitch encoding information can be quantized pitch information or a quantized pitch signal transmitted to the decoder 1010 over the channel 1040 by a corresponding encoder (e.g., an encoder associated with the decoder 1010, which may be the same as or similar to the encoder 805 of FIG. 8, etc.). In some aspects, the pitch dequantization engine 1038 can generate reconstructed pitch information based on performing a codebook lookup to dequantize the received pitch encoding obtained from the channel 1040 (e.g., to dequantize the quantized pitch encoding received by the decoder 1010 over the channel 1040). The dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 1038 can include pitch information in the frequency-domain (e.g., f0 pitch frequency information in Hz), pitch information in the time-domain (e.g., pitch lag or pitch delay information in samples, milliseconds, etc.), and/or pitch correlation information. In some aspects, the dequantized (e.g., reconstructed) pitch information determined by the pitch dequantization engine 1038 can include information indicative of a voiced or unvoiced (V/UV) classification. [0219] In one illustrative example, the neural speech synthesizer 1050 can be implemented as a linear predictive coding (LPC) network. For example, the LPC network-based neural speech synthesizer 1050 can be used to implement a linear prediction (LP) synthesis filter, based on ^^^^^^ ൌ ∑ ^ ^ୀ^ ^^^^^^^^ െ ^^^ ^ ^^^^^^. Here, ^^^^^^ represents the input signal to the LP filter, ^^^^^^ output signal, ^^^ represents the linear prediction coefficients associated with implementing the LP synthesis filter, and ^^ is a value corresponding to the LP filter order. For example, the LP filter order p can have a value of 16 or 12, etc., among various other LP filter order values. [0220] For example, the LPC-based neural speech synthesizer 1050 can include a frame rate network 1052 configured to process inputs comprising the decoded features obtained using the FRAE decoder 1045 and the decoded pitch information obtained using the pitch dequantization engine 1038. An LPC estimation engine 1062 can process the decoded features from the FRAE decoder 1045 to determine one or more estimated LP coefficients. In some aspects, the LPC estimation engine 1062 can generate estimated LP coefficients ^^^, based on an input comprising the decoded features determined using the FRAE decoder 1045. Qualcomm Docket No.2400178WO [0221] In some examples, the LP coefficients estimated using the LPC estimation engine 1062 can be provided to an LP prediction engine 1064, configured to generate as output a prediction p[n], where ^^^^^^ ൌ ∑ ^ ^ୀ^ ^^^^^^^^ െ ^^^ . In some aspects, the LP prediction engine 1064 can generate the LP prediction p[n] based on a first input comprising the LP coefficients ^^ ^ the LPC estimation engine 1062, and a feedback input s[n-1]. [0222] The estimated or predicted LP coefficients p[n] can be provided as input to a sample rate network 1072 and a downstream combination or summation operation 1080. The sample rate network 1072 can receive additional inputs comprising the output of the frame rate network 1052, the feedback s[n-1] from the previous step n-1, and an intermediate feedback value e[n-1] from the same previous step n-1. [0223] The output of the sample rate network 1072 can be the probability distribution P(e[n]), which is a probability distribution for e[n]. A sampling engine 1074 can perform sampling from the probability distribution P(e[n]) to obtain a realization or representation of e[n]. [0224] The representation of e[n] determined by the sampling engine 1074 can be combined with the LP prediction p[n] (e.g., determined by the LP prediction engine 1064), using the combination or summation operation 1080 to thereby generate as output the signal s[n]. In some aspects, the output signal s[n] can be the same as the reconstructed audio signal 1055 of the LPC network-based neural speech synthesizer 1050 and/or decoder 1010. [0225] The representation of e[n] determined by the sampling engine 1074 can additionally be provided to a first feedback calculation 1078, which generates as output e[n-1] provided as an additional input to the sample rate network 1072. [0226] The output signal s[n] of the combination or summation operation 1080 can be output as the reconstructed audio 1055 and may additionally be provided to a second feedback calculation 1079, which generates the representation s[n-1] based on the input s[n]. The representation s[n-1] can be provided as a feedback input to the LP prediction engine 1064 and to the sample rate network 1072. [0227] As noted previously, in some aspects, the systems and techniques described herein can be used to provide dynamic complexity adjustment for one or more of a Qualcomm Docket No.2400178WO decoder and/or an encoder included in an audio codec system, such as the audio codec system 800 of FIG.8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG.10, etc. [0228] In one illustrate example, the systems and techniques can be used to provide dynamic decoder complexity adjustment for one or more encoders and/or decoders included in the audio codec system 800 of FIG.8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG. 10, etc. For example, the systems and techniques can be used to provide dynamic decoder complexity adjustment, based on the decoder complexity switching engine 550 of FIG.5 being included in the decoder 810 of the audio codec system 800 of FIG. 8, the decoder 910 of the audio codec system 900 of FIG. 9, the decoder 1010 of the audio codec system 1000 of FIG. 10, etc. In some cases, one or more of the device resource monitor 520 of FIG. 5 and/or the decoder profiling engine 530 can be included in the decoder 810 of the audio codec system 800 of FIG. 8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc. [0229] In some aspects, the systems and techniques can be used to provide dynamic encoder complexity adjustment for an encoder included in an audio codec system, such as the audio codec system 800 of FIG.8, the audio codec system 900 of FIG.9, the audio codec system 1000 of FIG.10, etc. For example, the decoder complexity switching engine 550 of FIG. 5 (or an encoder complexity switching engine the same as or similar to the decoder complexity switching engine 550 of FIG. 5) can be used to provide dynamic encoder complexity switching for one or more of the encoder 805 of FIG.8, the encoder 905 of FIG.9, the encoder 1005 of FIG.10, etc. [0230] In some aspects, one or more of the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG.9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc., can be included in the plurality of decoders with different synthesis complexity 510-1, 510-2, …, 510-N of FIG. 5. In some aspects, one or more of the decoder 810 of the audio codec system 800 of FIG.8, the decoder 910 of the audio codec system 900 of FIG. 9, the decoder 1010 of the audio codec system 1000 of FIG.10, etc., can be included in the plurality of decoders with different synthesis complexity 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, 610-8 of FIGS.6A-6B. [0231] In one illustrative example, the systems and techniques for decoder complexity Qualcomm Docket No.2400178WO switching can be used to adjust the complexity of the neural speech synthesizer 850 included in the decoder 810 of the audio codec system 800 of FIG. 8. For example, the decoder 810 of FIG.8 can include the decoder complexity switching engine 550 of FIG. 5, and can use the decoder complexity switching engine 550 to switch between different decoder models or decoder instances for the neural speech synthesizer 850, where each decoder model or decoder instance is associated with a different respective complexity. In some examples, the different decoder models or decoder instances with different complexity 510-1, 510-2, …, 510-N of FIG.5 can correspond to different decoder models or decoder instances configured to implement the neural speech synthesizer 850 of FIG. 8. For example, the first decoder model 510-1 can be a first neural speech synthesizer 850 model with a first complexity level, the second decoder model 510-2 can be a second neural speech synthesizer 850 model with a second complexity level, …, and the Nth decoder model 510-N can be an Nth neural speech synthesizer 850 model with an Nth complexity level. [0232] In some cases, the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 810 and/or the neural speech synthesizer 850 of FIG.8, based on using the decoder complexity switching engine 550 of FIG.5 to add or remove layers from a neural network model used for the neural speech synthesizer 850, and/or to switch between different models for the neural speech synthesizer 850. [0233] In some aspects, the decoder 810 of FIG.8 can include one or more (or all) of the components of the dynamic decoder complexity switching system 500 of FIG. 5. In some cases, the decoder 810 of FIG.8 can include and/or implement the dynamic decoder complexity switching system 500 of FIG. 5, to perform complexity switching for the neural speech synthesizer 850 of FIG. 8. For example, the decoder 810 of FIG. 8 can include the device resource monitor 520 of FIG. 5 and can perform dynamic decoder complexity switching for the neural speech synthesizer 850 based at least in part on the device resource monitoring information obtained from the device resource monitor 520. In some cases, the decoder 810 of FIG.8 can include the decoder profiling engine 530 of FIG. 5, and can perform dynamic decoder complexity switching for the neural speech synthesizer 850 based at least in part on the decoder profiling information obtained from the decoder profiling engine 530. [0234] In some aspects, the systems and techniques for decoder complexity switching Qualcomm Docket No.2400178WO can be used to adjust the complexity of the NHV-based neural speech synthesizer 950 of FIG. 9. In one illustrative example, the systems and techniques for decoder complexity switching can be used to adjust the complexity of the neural filter estimator 952 included in the NHV-based neural speech synthesizer 950 of FIG. 9. For example, the decoder complexity switching engine 550 of FIG.5 can be included in the decoder 910 of FIG.9, and used to switch between different decoder models or decoder instances for the neural filter estimator 952 of the NHV-based neural speech synthesizer 950 of FIG.9. [0235] Each decoder model or decoder instance used to perform complexity switching for the neural filter estimator 952 can be associated with a different respective complexity. In some examples, the different decoder models or decoder instances with different complexity 510-1, 510-2, …, 510-N of FIG.5 can correspond to different decoder models or decoder instances configured to implement the neural filter estimator 952 of FIG.9. [0236] For example, the first decoder model 510-1 of FIG.5 can be a first neural filter estimator 952 model with a first complexity level, the second decoder model 510-2 of FIG.5 can be a second neural filter estimator 952 model with a second complexity level, …, and the Nth decoder model 510-N of FIG.5 can be an Nth neural filter estimator 952 model with an Nth complexity level. [0237] In some cases, the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 910 and/or the NHV-based neural speech synthesizer 950 of FIG.9 and/or the neural filter estimator 952 of FIG.9, based on using the decoder complexity switching engine 550 of FIG. 5 to add or remove layers from a neural network model used for the neural filter estimator 952, and/or to switch between different models for the neural filter estimator 952. [0238] In some aspects, the decoder 910 of FIG. 9 can include one or more (or all) of the components of the dynamic decoder complexity switching system 500 of FIG. 5. In some cases, the decoder 910 of FIG.9 can include and/or implement the dynamic decoder complexity switching system 500 of FIG. 5, to perform complexity switching for the NHV-based neural speech synthesizer 950 and/or the neural filter estimator 952 of FIG. 9. For example, the decoder 910 of FIG. 9 can include the device resource monitor 520 of FIG. 5 and can perform dynamic decoder complexity switching for the neural filter estimator 952 and/or the neural speech synthesizer 950, based at least in part on the device resource monitoring information obtained from the device resource monitor 520. In some Qualcomm Docket No.2400178WO cases, the decoder 910 of FIG.9 can include the decoder profiling engine 530 of FIG.5, and can perform dynamic decoder complexity switching for the neural filter estimator 952 and/or the neural speech synthesizer 950, based at least in part on the decoder profiling information obtained from the decoder profiling engine 530. [0239] In some examples, the systems and techniques for decoder complexity switching can be used to adjust the complexity of the LPC-based neural speech synthesizer 1050 of FIG.10. [0240] In one illustrative example, the systems and techniques for decoder complexity switching can be used to adjust the complexity of the frame rate network 1052 of the LPC-based neural speech synthesizer 1050 of FIG.10. In some aspects, the systems and techniques for decoder complexity switching can be used to adjust the complexity of the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG. 10. In some aspects, the systems and techniques for decoder complexity switching can be used to adjust the complexity of both the frame rate network 1052 and the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050 of FIG.10. [0241] For example, the decoder complexity switching engine 550 of FIG. 5 can be included in the decoder 1010 of FIG. 10, and used to switch between different decoder models or decoder instances for the frame rate network 1052 and/or the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG. 10. In some aspects, the decoder 1010 can include one instance of the decoder complexity switching engine 550 of FIG.5, and may use the same instance of the decoder complexity switching engine 550 to perform complexity switching for both the frame rate network 1052 and the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050 of FIG.10. [0242] In some examples, the decoder 1010 of FIG. 10 can include and use a first instance of the decoder complexity switching engine 550 of FIG.5 to perform complexity switching for the frame rate network 1052 of the LPC-based neural speech synthesizer 1050, and can include and use a second instance of the decoder complexity switching engine 550 of FIG.5 to perform complexity switching for the sample rate network 1072 of the LPC-based neural speech synthesizer 1050. [0243] Each decoder model or decoder instance used to perform complexity switching for the frame rate network 1052 and/or the sample rate network 1072 can be associated Qualcomm Docket No.2400178WO with a different respective complexity. In some examples, the different decoder models or decoder instances with different complexity 510-1, 510-2, …, 510-N of FIG. 5 can correspond to different decoder models or decoder instances configured to implement the frame rate network 1052 of FIG. 10. In some cases, the different decoder models or decoder instances with different complexity 510-1, 510-2, …, 510-N of FIG. 5 can correspond to different decoder models or decoder instances configured to implement the sample rate network 1072 of FIG.10. [0244] For example, the first decoder model 510-1 of FIG. 5 can be a first frame rate network 1052 model with a first complexity level, the second decoder model 510-2 of FIG. 5 can be a second frame rate network 1052 model with a second complexity level, …, and the Nth decoder model 510-N of FIG. 5 can be an Nth frame rate network 1052 model with an Nth complexity level. In another example, the first decoder model 510-1 of FIG.5 can be a first sample rate network 1072 model with a first complexity level, the second decoder model 510-2 of FIG.5 can be a second sample rate network 1072 model with a second complexity level, …, and the Nth decoder model 510-N of FIG. 5 can be an Nth sample rate network 1072 model with an Nth complexity level. [0245] In some cases, the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 1010 and/or the LPC-based neural speech synthesizer 1050 of FIG. 10 (e.g., one or more, or both, of the frame rate network 1052 and/or the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050), based on using the decoder complexity switching engine 550 of FIG. 5 to add or remove layers from a neural network model used for the frame rate network 1052, and/or to switch between different models for the frame rate network 1052. [0246] In another example, the systems and techniques can be used to implement dynamic decoder complexity switching for the decoder 1010 and/or the LPC-based neural speech synthesizer 1050 of FIG.10 (e.g., one or more, or both, of the frame rate network 1052 and/or the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050), based on using the decoder complexity switching engine 550 of FIG. 5 to add or remove layers from a neural network model used for the sample rate network 1072, and/or to switch between different models for the sample rate network 1072. [0247] In some aspects, the decoder 1010 of FIG. 10 can include one or more (or all) of the components of the dynamic decoder complexity switching system 500 of FIG. 5. Qualcomm Docket No.2400178WO In some cases, the decoder 1010 of FIG. 10 can include and/or implement the dynamic decoder complexity switching system 500 of FIG.5, to perform complexity switching for the LPC-based neural speech synthesizer 1050 and one or more (or both) of the frame rate network 1052 and/or the sample rate network 1072 included in the LPC-based neural speech synthesizer 1050. For example, the decoder 1010 of FIG.10 can include the device resource monitor 520 of FIG.5 and can perform dynamic decoder complexity switching for the frame rate network 1052 and/or the sample rate network 1072, based at least in part on the device resource monitoring information obtained from the device resource monitor 520. In some cases, the decoder 1010 of FIG.10 can include the decoder profiling engine 530 of FIG. 5, and can perform dynamic decoder complexity switching for the frame rate network 1052 and/or the sample rate network 1072 of FIG. 10, based at least in part on the decoder profiling information obtained from the decoder profiling engine 530. [0248] In some aspects, the systems and techniques can be configured to use the decoder complexity switching engine 550 of FIG.5 to switch between one or more neural network models associated with the NHC-based neural speech synthesizer 950 of FIG.9, and one or more neural network models associated with the LPC-based neural speech synthesizer 1050 of FIG.10. For example, the plurality of different decoder models with different respective complexity levels of FIG. 5 (e.g., the decoder models 510-1, 510-2, …, 510-N each associated with a respective complexity level) can include a first subset of different neural network models each with a different respective complexity for implementing the neural filter estimator 952 of the NHV-based neural speech synthesizer 950 of FIG.9, can include a second subset of different neural network models each with a different respective complexity for implementing the frame rate network 1052 of the LPC-based neural speech synthesizer 1050 of FIG. 10, and/or can include a third subset of different neural network models each with a different respective complexity for implementing the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG.10, etc. In one illustrative example, the NHV-based synthesizer 950 of FIG. 9 can be associated with a lower complexity than the LPC-based synthesizer 1050 of FIG. 10 (e.g., the LPC-based synthesizer 1050 of FIG. 10 can be higher complexity than the NHV-based synthesizer 950 of FIG. 9). The systems and techniques can use the decoder complexity switching engine 550 to switch between a relatively high complexity decoder configuration corresponding to the LPC-based synthesizer 1050 of FIG. 10 (e.g., used Qualcomm Docket No.2400178WO when the device resource monitor 520 is indicative of relatively low device load information or state) and a relatively lower complexity decoder configuration corresponding to the NHV-based synthesizer 950 of FIG. 9 (e.g., used when the device resource monitor 520 is indicative of relatively high device load information or state). [0249] FIG. 11 is a flowchart diagram illustrating an example of a process 1100 for processing one or more audio samples. In some examples, the process 1100 can be performed by a computing device or apparatus or a component or system (e.g., one or more chipsets, one or more processors such as one or more CPUs, DSPs, NPUs, NSPs, microcontrollers, ASICs, FPGAs, programmable logic devices, discrete gates or transistor logic components, discrete hardware components, etc., any combination thereof, and/or other component or system) of the computing device or apparatus. The operations of the process 1100 may be implemented as software components that are executed and run on one or more processors (e.g., processor 1210 of FIG. 12 or other processor(s)). In some examples, the process 1100 can be performed by an audio coding system or component thereof, including any of the audio coding systems and/or components thereof of FIGS. 1-10. In some aspects, the process 1100 can be performed by a UE, smartphone, mobile computing device, user computing device, etc. The process 1100 may be performed by an apparatus that may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device or augmented reality (AR) device, a vehicle or component or system of a vehicle, or other type of computing device. The operations of the process 1100 may be implemented as software components that are executed and run on one or more processors (e.g., processor 1210 of FIG.12, and/or other processor(s)). [0250] At block 1102, the apparatus (or component thereof) can obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit. [0251] For example, the audio frame can be included in the bitstream 502 of FIG.5. In some cases, the configured processing time completion limit can be the same as or similar to the configured processing time completion limit associated with the remaining time budget and the real-time processing limit comparisons 740-1, 740-2, …, etc., of FIG.7. [0252] In some cases, the configured processing time completion limit is equal to a periodicity between consecutive frames included in the sequence of the one or more audio Qualcomm Docket No.2400178WO frames. In some examples, the configured processing time completion limit is equal to a duration of each audio frame included in the sequence of the one or more audio frames. For example, the duration of each audio frame can be 20 milliseconds. In some cases, the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames. [0253] At block 1104, the apparatus (or component thereof) can obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus. [0254] For example, the monitoring information can be obtained from one or more of the device resource monitor 520 of FIG.5 and/or the decoder profiling engine 530 of FIG. 5. In some cases, the monitoring information can be obtained from both the device resource monitor 520 of FIG.5 and/or the decoder profiling engine 530 of FIG.5. In some cases, monitoring information associated with one or more previously decoded frames can be obtained using the decoder profiling engine 530 of FIG. 5, and the monitoring information associated with a resource utilization of the apparatus can be obtained from the device resource monitor 520 of FIG.5. [0255] In some cases, the previously decoded frame is decoded using a neural network- based decoder model of the plurality of neural network-based decoder models different from the selected neural network-based decoder model. [0256] In some examples, to obtain the monitoring information, the apparatus (or component thereof) can be configured to obtain device load information indicative of the resource utilization of the apparatus, where the device load information is obtained from a resource monitor of the apparatus. For example, the device load information can be obtained from the device resource monitor 520 of FIG.5. [0257] In some cases, to obtain the monitoring information the apparatus (or component thereof) can be configured to obtain one or more decoding time measurements corresponding to the previously decoded frame. For example, the one or more decoding time measurements can be obtained using the decoder profiling engine 530 of FIG.5. [0258] In some examples, to obtain the monitoring information, the apparatus (or component thereof) can be configured to obtain device load information indicative of the resource utilization of the apparatus, where the device load information is obtained from Qualcomm Docket No.2400178WO a resource monitor of the apparatus, and to obtain one or more decoding time measurements corresponding to the previously decoded frame, where the one or more decoding time measurements are obtained from a decoder system of the apparatus. [0259] In some examples, the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame. For example, the one or more decoding time measurements can include one or more of a decoder start time associated with decoding the previously decoded frame, a decoder end time associated with decoding the previously decoded frame, or a decoder elapsed time associated with decoding the previously decoded frame. In some cases, the monitoring information is indicative of a respective one or more decoding time measurements corresponding to each previously decoded frame of multiple previously decoded frames included in the sequence. [0260] In some examples, the monitoring information is indicative of a current utilization percentage associated with the one or more processors. In some cases, the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the apparatus. In some examples, the monitoring information is indicative of a memory usage associated with the apparatus. In some cases, the resource utilization of the apparatus is an instantaneous resource utilization. In some examples, the resource utilization of the apparatus is an average resource utilization over a configured length of time. [0261] At block 1106, the apparatus (or component thereof) can determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity. [0262] For example, the decoder complexity switching engine 550 of FIG. 5 can be used to determine the selected neural network-based decoder model. In some cases, the selected neural network-0based decoder model is selected from the plurality of neural network-based neural decoder models 510-1, 510-2, …, 510-N of FIG.5. In some cases, Qualcomm Docket No.2400178WO the plurality of neural network-based decoder models can each have a respective complexity, and can be the same as or similar to one or more of the decoder models each with respective complexity 610-1, 610-2, 610-3, 610-4, 610-5, 610-6, 610-7, 610-8 of FIGS.6A-6B. [0263] In some examples, the selected neural network-based decoder model is a selected neural speech synthesizer model, for example corresponding to the neural speech synthesizer 850 of FIG.8, 950 of FIG.9, and/or 1050 of FIG.10, etc. In some cases, the selected neural network-based decoder model is selected from a plurality of neural speech synthesizer neural network models. [0264] In some cases, the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis, and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis. For example, the selected NHV model for neural speech synthesis can correspond to the NHV-based neural speech synthesizer 950 of FIG. 9. In some cases, the selected neural network-based decoder model is a selected neural network model corresponding to a neural filter estimator of a neural homomorphic vocoder (NHV)-based neural speech synthesizer. For example, the selected neural network-based decoder model can be a selected neural network model corresponding to the neural filter estimator 952 of FIG.9. [0265] In some examples, the selected neural network-based decoder model is a selected linear predictive coding (LPC) model for neural speech synthesis, and the selected neural network-based decoder model is selected from a plurality of LPC-based neural network models for neural speech synthesis. For example, the selected neural network-based decoder model can correspond to the LPC-based neural speech synthesizer 1050 of FIG. 10. In some cases, the selected neural network-based decoder model is a selected neural network model corresponding to a frame rate network of a linear predictive coding (LPC)-based neural speech synthesizer. For example, the selected neural network-based decoder model can correspond to the frame rate network 1052 of the LPC-based neural speech synthesizer 1050 of FIG.10. In some examples, the selected neural network-based decoder model is a selected neural network model corresponding to a sample rate network of a linear predictive coding (LPC)-based neural speech synthesizer. For example, the selected neural network-based decoder model can Qualcomm Docket No.2400178WO correspond to the sample rate network 1072 of the LPC-based neural speech synthesizer 1050 of FIG.10. [0266] In some cases, the corresponding complexity is an increased decoder complexity, and the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the apparatus, or a processing time of the previously decoded frame being less than the configured processing time completion limit. [0267] In some examples, the corresponding complexity is a decreased decoder complexity, and the selected neural network-based decoder model is selected based on one or more of an indication of an increase in the resource utilization of the apparatus, or a processing time of the previously decoded frame exceeding the configured processing time completion limit. [0268] At block 1108, the apparatus (or component thereof) can process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame. [0269] For example, to process the audio frame using the selected neural network-based decoder model, the apparatus (or component thereof) can be configured to switch from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, where the first neural network-based decoder model is included in the plurality of neural network-based decoder models. For example, switching between the first and second neural network-based decoder models can be performed using the switch 545 of FIG. 5, and based on decoder complexity switching information determined by the decoder complexity switching engine 550 of FIG.5. [0270] In some cases, the corresponding complexity of the selected neural network- based decoder model is greater than a first complexity associated with the first neural network-based decoder model. In some examples, the corresponding complexity of the selected neural network-based decoder model is lesser than a first complexity associated with the first neural network-based decoder model. [0271] In some examples, to process the audio frame using the selected neural network- based decoder model, the apparatus (or component thereof) can be configured to Qualcomm Docket No.2400178WO determine an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model, and to determine a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time. The apparatus (or component thereof) can be configured to skip processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold. [0272] At block 1110, the apparatus (or component thereof) can output the decoded audio frame within the configured processing time completion limit. In some examples, the apparatus comprises an audio decoder or a voice decoder of a signal synthesis system. In some cases, the apparatus further comprises a microphone configured to obtain the one or more audio frames. In some examples, the apparatus further comprises one or more speakers configured to output the decoded audio frame within the configured processing time completion limit. [0273] In some cases, an apparatus used to implement the process 1100 can include microphone configured to obtain the one or more audio samples. In some cases, the apparatus further comprises one or more microphones configured to capture the one or more audio samples for speech synthesis. [0274] In some cases, the processes described herein (e.g., the process 1100 and/or any other process described herein) may be performed by a computing device or apparatus. In one example, the process 1100 and/or other technique or process described herein can be performed by a computing system having an architecture according to any of FIGS.1- 12B. In another example, the process 1100 and/or other technique or process described herein can be performed by the computing system 1200 shown in FIG. 12. For instance, a computing device with the computing device architecture of the computing system 1200 shown in FIG.12 can implement the operations of the process 1100, and/or can implement one or more of the components and/or operations described herein with respect to any of FIGS.1-10. [0001] In some cases, the computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) that are configured to carry out Qualcomm Docket No.2400178WO the steps of processes described herein. In some examples, the computing device may include a display, one or more network interfaces configured to communicate and/or receive the data, any combination thereof, and/or other component(s). The one or more network interfaces may be configured to communicate and/or receive wired and/or wireless data, including data according to the 3G, 4G, 5G, and/or other cellular standard, data according to the WiFi (802.11x) standards, data according to the BluetoothTM standard, data according to the Internet Protocol (IP) standard, and/or other types of data. [0275] The components of the computing device may be implemented in circuitry. For example, the components may include and/or may be implemented using electronic circuits or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or may include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein. [0276] The process 1100 is illustrated as a logical flow diagram, the operation of which represents a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof. In the context of computer instructions, the operations represent computer-executable instructions stored on one or more computer- readable storage media that, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the processes. [0277] Additionally, the process 1100 and/or other process described herein, may be performed under the control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or combinations thereof. As noted above, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more Qualcomm Docket No.2400178WO processors. The computer-readable or machine-readable storage medium may be non- transitory. [0278] FIG. 12 is a diagram illustrating an example of a system for implementing certain aspects of the present technology. In particular, FIG.12 illustrates an example of computing system 1200, which may be for example any computing device making up internal computing system, a remote computing system, a camera, or any component thereof in which the components of the system are in communication with each other using connection 1205. Connection 1205 may be a physical connection using a bus, or a direct connection into processor 1210, such as in a chipset architecture. Connection 1205 may also be a virtual connection, networked connection, or logical connection. [0279] In some aspects, computing system 1200 is a distributed system in which the functions described in this disclosure may be distributed within a datacenter, multiple data centers, a peer network, etc. In some aspects, one or more of the described system components represents many such components each performing some or all of the function for which the component is described. In some aspects, the components may be physical or virtual devices. [0280] Example system 1200 includes at least one processing unit (CPU or processor) 1210 and connection 1205 that communicatively couples various system components including system memory 1215, such as read-only memory (ROM) 1220 and random access memory (RAM) 1225 to processor 1210. Computing system 1200 may include a cache 1215 of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 1210. [0281] Processor 1210 may include any general-purpose processor and a hardware service or software service, such as services 1232, 1234, and 1236 stored in storage device 1230, configured to control processor 1210 as well as a special-purpose processor where software instructions are incorporated into the actual processor design. Processor 1210 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric. [0282] To enable user interaction, computing system 1200 includes an input device 1245, which may represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion Qualcomm Docket No.2400178WO input, speech, etc. Computing system 1200 may also include output device 1235, which may be one or more of a number of output mechanisms. In some instances, multimodal systems may enable a user to provide multiple types of input/output to communicate with computing system 1200. [0283] Computing system 1200 may include communications interface 1240, which may generally govern and manage the user input and system output. The communication interface may perform or facilitate receipt and/or transmission wired or wireless communications using wired and/or wireless transceivers, including those making use of an audio jack/plug, a microphone jack/plug, a universal serial bus (USB) port/plug, an AppleTM LightningTM port/plug, an Ethernet port/plug, a fiber optic port/plug, a proprietary wired port/plug, 3G, 4G, 5G and/or other cellular data network wireless signal transfer, a BluetoothTM wireless signal transfer, a BluetoothTM low energy (BLE) wireless signal transfer, an IBEACONTM wireless signal transfer, a radio-frequency identification (RFID) wireless signal transfer, near-field communications (NFC) wireless signal transfer, dedicated short range communication (DSRC) wireless signal transfer, 802.11 Wi-Fi wireless signal transfer, wireless local area network (WLAN) signal transfer, Visible Light Communication (VLC), Worldwide Interoperability for Microwave Access (WiMAX), Infrared (IR) communication wireless signal transfer, Public Switched Telephone Network (PSTN) signal transfer, Integrated Services Digital Network (ISDN) signal transfer, ad-hoc network signal transfer, radio wave signal transfer, microwave signal transfer, infrared signal transfer, visible light signal transfer, ultraviolet light signal transfer, wireless signal transfer along the electromagnetic spectrum, or some combination thereof. The communications interface 1240 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers that are used to determine a location of the computing system 1200 based on receipt of one or more signals from one or more satellites associated with one or more GNSS systems. GNSS systems include, but are not limited to, the US-based Global Positioning System (GPS), the Russia-based Global Navigation Satellite System (GLONASS), the China-based BeiDou Navigation Satellite System (BDS), and the Europe-based Galileo GNSS. There is no restriction on operating on any particular hardware arrangement, and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed. Qualcomm Docket No.2400178WO [0284] Storage device 1230 may be a non-volatile and/or non-transitory and/or computer-readable memory device and may be a hard disk or other types of computer readable media which may store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, a floppy disk, a flexible disk, a hard disk, magnetic tape, a magnetic strip/stripe, any other magnetic storage medium, flash memory, memristor memory, any other solid-state memory, a compact disc read only memory (CD-ROM) optical disc, a rewritable compact disc (CD) optical disc, digital video disk (DVD) optical disc, a blu- ray disc (BDD) optical disc, a holographic optical disk, another optical medium, a secure digital (SD) card, a micro secure digital (microSD) card, a Memory Stick® card, a smartcard chip, a EMV chip, a subscriber identity module (SIM) card, a mini/micro/nano/pico SIM card, another integrated circuit (IC) chip/card, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash EPROM (FLASHEPROM), cache memory (e.g., Level 1 (L1) cache, Level 2 (L2) cache, Level 3 (L3) cache, Level 4 (L4) cache, Level 5 (L5) cache, or other (L#) cache), resistive random-access memory (RRAM/ReRAM), phase change memory (PCM), spin transfer torque RAM (STT-RAM), another memory chip or cartridge, and/or a combination thereof. [0285] The storage device 1230 may include software services, servers, services, etc., that when the code that defines such software is executed by the processor 1210, it causes the system to perform a function. In some aspects, a hardware service that performs a particular function may include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 1210, connection 1205, output device 1235, etc., to carry out the function. The term “computer- readable medium” includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data. A computer-readable medium may include a non- transitory medium in which data may be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), Qualcomm Docket No.2400178WO flash memory, memory or memory devices. A computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like. [0286] Specific details are provided in the description above to provide a thorough understanding of the aspects and examples provided herein, but those skilled in the art will recognize that the application is not limited thereto. Thus, while illustrative aspects of the application have been described in detail herein, it is to be understood that the inventive concepts may be otherwise variously embodied and employed, and that the appended claims are intended to be construed to include such variations, except as limited by the prior art. Various features and aspects of the above-described application may be used individually or jointly. Further, aspects may be utilized in any number of environments and applications beyond those described herein without departing from the broader scope of the specification. The specification and drawings are, accordingly, to be regarded as illustrative rather than restrictive. For the purposes of illustration, methods were described in a particular order. It should be appreciated that in alternate aspects, the methods may be performed in a different order than that described. [0287] For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software. Additional components may be used other than those shown in the figures and/or described herein. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the aspects in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects. [0288] Further, those of skill in the art will appreciate that the various illustrative logical Qualcomm Docket No.2400178WO blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure. [0289] Individual aspects may be described above as a process or method which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations may be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed, but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function. [0290] Processes and methods according to the above-described examples may be implemented using computer-executable instructions that are stored or otherwise available from computer-readable media. Such instructions may include, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used may be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code. Examples of computer- readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on. Qualcomm Docket No.2400178WO [0291] In some aspects the computer-readable storage devices, mediums, and memories may include a cable or wireless signal containing a bitstream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se. [0292] Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof, in some cases depending in part on the particular application, in part on the desired design, in part on the corresponding technology, etc. [0293] The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed using hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors. When implemented in software, firmware, middleware, or microcode, the program code or code segments to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable medium. A processor(s) may perform the necessary tasks. Examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also may be embodied in peripherals or add-in cards. Such functionality may also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example. [0294] The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure. [0295] The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Qualcomm Docket No.2400178WO Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer- readable data storage medium comprising program code including instructions that, when executed, performs one or more of the methods, algorithms, and/or operations described above. The computer-readable data storage medium may form part of a computer program product, which may include packaging materials. The computer-readable medium may comprise memory or data storage media, such as random access memory (RAM) such as synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, and the like. The techniques additionally, or alternatively, may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that may be accessed, read, and/or executed by a computer, such as propagated signals or waves. [0296] The program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Such a processor may be configured to perform any of the techniques described in this disclosure. A general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein. [0297] One of ordinary skill will appreciate that the less than (“<”) and greater than (“>”) symbols or terminology used herein may be replaced with less than or equal to (“^”) and greater than or equal to (“^”) symbols, respectively, without departing from the scope of this description. Qualcomm Docket No.2400178WO [0298] Where components are described as being “configured to” perform certain operations, such configuration may be accomplished, for example, by designing electronic circuits or other hardware to perform the operation, by programming programmable electronic circuits (e.g., microprocessors, or other suitable electronic circuits) to perform the operation, or any combination thereof. [0299] The phrase “coupled to” or “communicatively coupled to” refers to any component that is physically connected to another component either directly or indirectly, and/or any component that is in communication with another component (e.g., connected to the other component over a wired or wireless connection, and/or other suitable communication interface) either directly or indirectly. [0300] Claim language or other language reciting “at least one of” a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim. For example, claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B. In another example, claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, A and B and C, or any duplicate information or data (e.g., A and A, B and B, C and C, A and A and B, and so on), or any other ordering, duplication, or combination of A, B, and C. The language “at least one of” a set and/or “one or more” of a set does not limit the set to the items listed in the set. For example, claim language reciting “at least one of A and B” or “at least one of A or B” may mean A, B, or A and B, and may additionally include items not listed in the set of A and B. The phrases “at least one” and “one or more” are used interchangeably herein. [0301] Claim language or other language reciting “at least one processor configured to,” “at least one processor being configured to,” “one or more processors configured to,” “one or more processors being configured to,” or the like indicates that one processor or multiple processors (in any combination) can perform the associated operation(s). For example, claim language reciting “at least one processor configured to: X, Y, and Z” means a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each tasked with a certain subset of operations X, Y, and Z such that together the multiple processors perform X, Y, and Z; or that a group of multiple processors work together to perform operations X, Y, and Z. In another example, claim Qualcomm Docket No.2400178WO language reciting “at least one processor configured to: X, Y, and Z” can mean that any single processor may only perform at least a subset of operations X, Y, and Z. [0302] Where reference is made to one or more elements performing functions (e.g., steps of a method), one element may perform all functions, or more than one element may collectively perform the functions. When more than one element collectively performs the functions, each function need not be performed by each of those elements (e.g., different functions may be performed by different elements) and/or each function need not be performed in whole by only one element (e.g., different elements may perform different sub-functions of a function). Similarly, where reference is made to one or more elements configured to cause another element (e.g., an apparatus) to perform functions, one element may be configured to cause the other element to perform all functions, or more than one element may collectively be configured to cause the other element to perform the functions. [0303] Where reference is made to an entity (e.g., any entity or device described herein) performing functions or being configured to perform functions (e.g., steps of a method), the entity may be configured to cause one or more elements (individually or collectively) to perform the functions. The one or more components of the entity may include at least one memory, at least one processor, at least one communication interface, another component configured to perform one or more (or all) of the functions, and/or any combination thereof. Where reference to the entity performing functions, the entity may be configured to cause one component to perform all functions, or to cause more than one component to collectively perform the functions. When the entity is configured to cause more than one component to collectively perform the functions, each function need not be performed by each of those components (e.g., different functions may be performed by different components) and/or each function need not be performed in whole by only one component (e.g., different components may perform different sub-functions of a function). [0304] Illustrative aspects of the disclosure include: [0305] Aspect 1. An apparatus configured to process one or more audio frames, the apparatus comprising: one or more memories configured to store the one or more audio frames; and one or more processors coupled to the one or more memories, the one or more processors being configured to: obtain an audio frame included in a sequence of the one Qualcomm Docket No.2400178WO or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit. [0306] Aspect 2. The apparatus of Aspect 1, wherein, to process the audio frame using the selected neural network-based decoder model, the one or more processors are configured to: switch from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, wherein the first neural network-based decoder model is included in the plurality of neural network-based decoder models. [0307] Aspect 3. The apparatus of Aspect 2, wherein: the corresponding complexity of the selected neural network-based decoder model is greater than a first complexity associated with the first neural network-based decoder model. [0308] Aspect 4. The apparatus of any of Aspects 2 to 3, wherein: the corresponding complexity of the selected neural network-based decoder model is lesser than a first complexity associated with the first neural network-based decoder model. [0309] Aspect 5. The apparatus of any of Aspects 1 to 4, wherein: the corresponding complexity is an increased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the apparatus, or a processing time of the previously decoded frame being less than the configured processing time completion limit. [0310] Aspect 6. The apparatus of any of Aspects 1 to 5, wherein: the corresponding complexity is a decreased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of an increase in the Qualcomm Docket No.2400178WO resource utilization of the apparatus, or a processing time of the previously decoded frame exceeding the configured processing time completion limit. [0311] Aspect 7. The apparatus of any of Aspects 1 to 6, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain device load information indicative of the resource utilization of the apparatus, wherein the device load information is obtained from a resource monitor of the apparatus. [0312] Aspect 8. The apparatus of any of Aspects 1 to 7, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain one or more decoding time measurements corresponding to the previously decoded frame. [0313] Aspect 9. The apparatus of any of Aspects 1 to 8, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain device load information indicative of the resource utilization of the apparatus, wherein the device load information is obtained from a resource monitor of the apparatus; and obtain one or more decoding time measurements corresponding to the previously decoded frame, wherein the one or more decoding time measurements are obtained from a decoder system of the apparatus. [0314] Aspect 10. The apparatus of any of Aspects 1 to 9, wherein the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame. [0315] Aspect 11. The apparatus of Aspect 10, wherein the one or more decoding time measurements include one or more of a decoder start time associated with decoding the previously decoded frame, a decoder end time associated with decoding the previously decoded frame, or a decoder elapsed time associated with decoding the previously decoded frame. [0316] Aspect 12. The apparatus of any of Aspects 1 to 11, wherein the monitoring information is indicative of a respective one or more decoding time measurements corresponding to each previously decoded frame of multiple previously decoded frames included in the sequence. [0317] Aspect 13. The apparatus of any of Aspects 1 to 12, wherein the previously decoded frame is decoded using a neural network-based decoder model of the plurality Qualcomm Docket No.2400178WO of neural network-based decoder models different from the selected neural network-based decoder model. [0318] Aspect 14. The apparatus of any of Aspects 1 to 13, wherein the monitoring information is indicative of a current utilization percentage associated with the one or more processors. [0319] Aspect 15. The apparatus of any of Aspects 1 to 14, wherein the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the apparatus. [0320] Aspect 16. The apparatus of any of Aspects 1 to 15, wherein the monitoring information is indicative of a memory usage associated with the apparatus. [0321] Aspect 17. The apparatus of any of Aspects 1 to 16, wherein the resource utilization of the apparatus is an instantaneous resource utilization. [0322] Aspect 18. The apparatus of any of Aspects 1 to 17, wherein the resource utilization of the apparatus is an average resource utilization over a configured length of time. [0323] Aspect 19. The apparatus of any of Aspects 1 to 18, wherein, to process the audio frame using the selected neural network-based decoder model, the one or more processors are configured to: determine an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model; determine a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time; and skip processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold. [0324] Aspect 20. The apparatus of any of Aspects 1 to 19, wherein the configured processing time completion limit is equal to a periodicity between consecutive frames included in the sequence of the one or more audio frames. [0325] Aspect 21. The apparatus of any of Aspects 1 to 20, wherein the configured processing time completion limit is equal to a duration of each audio frame included in Qualcomm Docket No.2400178WO the sequence of the one or more audio frames, and wherein the duration of each audio frame is 20 milliseconds. [0326] Aspect 22. The apparatus of any of Aspects 1 to 21, wherein the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames. [0327] Aspect 23. The apparatus of any of Aspects 1 to 22, wherein the apparatus comprises an audio decoder or a voice decoder of a signal synthesis system. [0328] Aspect 24. The apparatus of any of Aspects 1 to 23, further comprising a microphone configured to obtain the one or more audio frames. [0329] Aspect 25. The apparatus of any of Aspects 1 to 24, further comprising one or more speakers configured to output the decoded audio frame within the configured processing time completion limit. [0330] Aspect 26. The apparatus of any of Aspects 1 to 25, wherein: the selected neural network-based decoder model is a selected neural speech synthesizer model; and the selected neural network-based decoder model is selected from a plurality of neural speech synthesizer neural network models. [0331] Aspect 27. The apparatus of any of Aspects 1 to 26, wherein: the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis. [0332] Aspect 28. The apparatus of any of Aspects 1 to 27, wherein: the selected neural network-based decoder model is a selected neural network model corresponding to a neural filter estimator of a neural homomorphic vocoder (NHV)-based neural speech synthesizer. [0333] Aspect 29. The apparatus of any of Aspects 1 to 28, wherein: the selected neural network-based decoder model is a selected linear predictive coding (LPC) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of LPC-based neural network models for neural speech synthesis. Qualcomm Docket No.2400178WO [0334] Aspect 30. The apparatus of any of Aspects 1 to 29, wherein: the selected neural network-based decoder model is a selected neural network model corresponding to a frame rate network of a linear predictive coding (LPC)-based neural speech synthesizer. [0335] Aspect 31. The apparatus of any of Aspects 1 to 30,wherein the selected neural network-based decoder model is a selected neural network model corresponding to a sample rate network of a linear predictive coding (LPC)-based neural speech synthesizer. [0336] Aspect 32. A method for processing one or more audio frames, the method comprising: obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of a decoder; determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models each having a respective complexity; processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and outputting the decoded audio frame within the configured processing time completion limit. [0337] Aspect 33. The method of Aspect 32, wherein processing the audio frame using the selected neural network-based decoder model comprises: switching from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, wherein the first neural network-based decoder model is included in the plurality of neural network-based decoder models. [0338] Aspect 34. The method of Aspect 33, wherein: the corresponding complexity of the selected neural network-based decoder model is greater than a first complexity associated with the first neural network-based decoder model. [0339] Aspect 35. The apparatus of any of Aspects 33 to 34, wherein: the corresponding complexity of the selected neural network-based decoder model is lesser than a first complexity associated with the first neural network-based decoder model. Qualcomm Docket No.2400178WO [0340] Aspect 36. The method of any of Aspects 32 to 35, wherein: the corresponding complexity is an increased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the decoder, or a processing time of the previously decoded frame being less than the configured processing time completion limit. [0341] Aspect 37. The method of any of Aspects 32 to 36, wherein: the corresponding complexity is a decreased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of an increase in the resource utilization of the decoder, or a processing time of the previously decoded frame exceeding the configured processing time completion limit. [0342] Aspect 38. The method of any of Aspects 32 to 37, wherein obtaining the monitoring information comprises: obtaining device load information indicative of the resource utilization of the decoder, wherein the device load information is obtained from a resource monitor of the decoder. [0343] Aspect 39. The method of any of Aspects 32 to 38, wherein obtaining the monitoring information comprises: obtaining one or more decoding time measurements corresponding to the previously decoded frame. [0344] Aspect 40. The method of any of Aspects 32 to 39, wherein obtaining the monitoring information comprises: obtaining device load information indicative of the resource utilization of the decoder, wherein the device load information is obtained from a resource monitor of the decoder; and obtaining one or more decoding time measurements corresponding to the previously decoded frame.. [0345] Aspect 41. The method of any of Aspects 32 to 40, wherein the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame. [0346] Aspect 42. The apparatus of Aspect 41, wherein the one or more decoding time measurements include one or more of a decoder start time associated with decoding the previously decoded frame, a decoder end time associated with decoding the previously decoded frame, or a decoder elapsed time associated with decoding the previously decoded frame. Qualcomm Docket No.2400178WO [0347] Aspect 43. The method of any of Aspects 32 to 42, wherein the monitoring information is indicative of a respective one or more decoding time measurements corresponding to each previously decoded frame of multiple previously decoded frames included in the sequence. [0348] Aspect 44. The method of any of Aspects 32 to 43, wherein the previously decoded frame is decoded using a neural network-based decoder model of the plurality of neural network-based decoder models different from the selected neural network-based decoder model. [0349] Aspect 45. The method of any of Aspects 32 to 44, wherein the monitoring information is indicative of a current utilization percentage associated with the one or more processors. [0350] Aspect 46. The method of any of Aspects 32 to 45, wherein the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the decoder. [0351] Aspect 47. The method of any of Aspects 32 to 46, wherein the monitoring information is indicative of a memory usage associated with the decoder. [0352] Aspect 48. The method of any of Aspects 32 to 47, wherein the resource utilization of the decoder is an instantaneous resource utilization. [0353] Aspect 48. The method of any of Aspects 32 to 47, wherein the resource utilization of the decoder is an average resource utilization over a configured length of time. [0354] Aspect 50. The method of any of Aspects 32 to 49, wherein processing the audio frame using the selected neural network-based decoder model comprises: determining an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model; determining a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time; and skipping processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold. Qualcomm Docket No.2400178WO [0355] Aspect 51. The method of any of Aspects 32 to 50, wherein the configured processing time completion limit is equal to a periodicity between consecutive frames included in the sequence of the one or more audio frames. [0356] Aspect 52. The method of any of Aspects 32 to 51, wherein the configured processing time completion limit is equal to a duration of each audio frame included in the sequence of the one or more audio frames, and wherein the duration of each audio frame is 20 milliseconds. [0357] Aspect 53. The method of any of Aspects 32 to 52, wherein the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames. [0358] Aspect 54. The method of any of Aspects 32 to 53, wherein: the selected neural network-based decoder model is a selected neural speech synthesizer model; and the selected neural network-based decoder model is selected from a plurality of neural speech synthesizer neural network models. [0359] Aspect 55. The method of any of Aspects 32 to 54, wherein: the selected neural network-based decoder model is a selected neural homomorphic vocoder (NHV) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of NHV-based neural network models for neural speech synthesis. [0360] Aspect 56. The method of any of Aspects 32 to 55, wherein: the selected neural network-based decoder model is a selected neural network model corresponding to a neural filter estimator of a neural homomorphic vocoder (NHV)-based neural speech synthesizer. [0361] Aspect 57. The method of any of Aspects 32 to 56, wherein: the selected neural network-based decoder model is a selected linear predictive coding (LPC) model for neural speech synthesis; and the selected neural network-based decoder model is selected from a plurality of LPC-based neural network models for neural speech synthesis. [0362] Aspect 58. The method of any of Aspects 32 to 57, wherein: the selected neural network-based decoder model is a selected neural network model corresponding to a frame rate network of a linear predictive coding (LPC)-based neural speech synthesizer. Qualcomm Docket No.2400178WO [0363] Aspect 59. The method of any of Aspects 32 to 58,wherein the selected neural network-based decoder model is a selected neural network model corresponding to a sample rate network of a linear predictive coding (LPC)-based neural speech synthesizer. [0364] Aspect 60. A non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit. [0365] Aspect 61. A method for processing one or more audio samples, comprising performing operations according to any of Aspects 1 to 31. [0366] Aspect 62. A non-transitory computer-readable storage medium comprising instructions stored thereon which, when executed by at least one processor, causes the at least one processor to perform operations according to any of Aspects 1 to 59. [0367] Aspect 63. An apparatus for processing one or more audio samples, the apparatus comprising one or more means for performing operations according to any of Aspects 1 to 60.

Claims

Qualcomm Docket No.2400178WO CLAIMS What is claimed is: 1. An apparatus configured to process one or more audio frames, the apparatus comprising: one or more memories configured to store the one or more audio frames; and one or more processors coupled to the one or more memories, the one or more processors being configured to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network- based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit. 2. The apparatus of claim 1, wherein, to process the audio frame using the selected neural network-based decoder model, the one or more processors are configured to: switch from a first neural network-based decoder model used to process the previously decoded frame to the selected neural network-based decoder model, wherein the first neural network-based decoder model is included in the plurality of neural network-based decoder models. Qualcomm Docket No.2400178WO 3. The apparatus of claim 2, wherein: the corresponding complexity of the selected neural network-based decoder model is greater than a first complexity associated with the first neural network-based decoder model. 4. The apparatus of claim 2, wherein: the corresponding complexity of the selected neural network-based decoder model is lesser than a first complexity associated with the first neural network-based decoder model. 5. The apparatus of claim 1, wherein: the corresponding complexity is an increased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of a decrease in the resource utilization of the apparatus, or a processing time of the previously decoded frame being less than the configured processing time completion limit. 6. The apparatus of claim 1, wherein: the corresponding complexity is a decreased decoder complexity; and the selected neural network-based decoder model is selected based on one or more of an indication of an increase in the resource utilization of the apparatus, or a processing time of the previously decoded frame exceeding the configured processing time completion limit. 7. The apparatus of claim 1, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain device load information indicative of the resource utilization of the apparatus, wherein the device load information is obtained from a resource monitor of the apparatus. 8. The apparatus of claim 1, wherein, to obtain the monitoring information, the one or more processors are configured to: Qualcomm Docket No.2400178WO obtain one or more decoding time measurements corresponding to the previously decoded frame. 9. The apparatus of claim 1, wherein, to obtain the monitoring information, the one or more processors are configured to: obtain device load information indicative of the resource utilization of the apparatus, wherein the device load information is obtained from a resource monitor of the apparatus; and obtain one or more decoding time measurements corresponding to the previously decoded frame, wherein the one or more decoding time measurements are obtained from a decoder system of the apparatus. 10. The apparatus of claim 1, wherein: the monitoring information is indicative of one or more decoding time measurements corresponding to the previously decoded frame; and the one or more decoding time measurements include one or more of a decoder start time associated with decoding the previously decoded frame, a decoder end time associated with decoding the previously decoded frame, or a decoder elapsed time associated with decoding the previously decoded frame. 11. The apparatus of claim 1, wherein the previously decoded frame is decoded using a neural network-based decoder model of the plurality of neural network-based decoder models different from the selected neural network-based decoder model. 12. The apparatus of claim 1, wherein the monitoring information includes device load information indicative of a current utilization percentage associated with a central processing unit (CPU) or a digital signal processor (DSP) included in the apparatus. 13. The apparatus of claim 1, wherein, to process the audio frame using the selected neural network-based decoder model, the one or more processors are configured to: determine an elapsed execution time corresponding to processing of the audio frame by a particular intermediate layer of one or more intermediate layers included between input and output layers of the selected neural network-based decoder model; Qualcomm Docket No.2400178WO determine a remaining time corresponding to a difference between the configured processing time completion limit and the elapsed execution time; and skip processing of the audio frame by a remaining subset of the one or more intermediate layers, based on the remaining time being less than a threshold. 14. The apparatus of claim 1, wherein the configured processing time completion limit is equal to a periodicity between consecutive frames included in the sequence of the one or more audio frames. 15. The apparatus of claim 1, wherein the configured processing time completion limit is equal to a duration of each audio frame included in the sequence of the one or more audio frames, and wherein the duration of each audio frame is 20 milliseconds. 16. The apparatus of claim 1, wherein the configured processing time completion limit comprises a real-time processing limit to decode each audio frame included in the sequence of the one or more audio frames. 17. The apparatus of claim 1, wherein the apparatus comprises an audio decoder or a voice decoder of a signal synthesis system. 18. The apparatus of claim 1, further comprising a microphone configured to obtain the one or more audio frames. 19. A method for processing one or more audio frames, the method comprising: obtaining an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtaining monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of a decoder; determining, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is Qualcomm Docket No.2400178WO selected from a plurality of neural network-based decoder models each having a respective complexity; processing the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and outputting the decoded audio frame within the configured processing time completion limit. 20. A non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: obtain an audio frame included in a sequence of the one or more audio frames, wherein the sequence is associated with a configured processing time completion limit; obtain monitoring information associated with one or more of a previously decoded frame included in the sequence or a resource utilization of the apparatus; determine, based on the monitoring information and the configured processing time completion limit, a selected neural network-based decoder model having a corresponding complexity to decode the audio frame within the configured processing time completion limit, wherein the selected neural network-based decoder model is selected from a plurality of neural network-based decoder models associated with the apparatus and each having a respective complexity; process the audio frame using the selected neural network-based decoder model to thereby generate a decoded audio frame; and output the decoded audio frame within the configured processing time completion limit.
PCT/US2025/028254 2024-05-14 2025-05-07 Signal synthesis using dynamic decoder complexity scaling based on device load Pending WO2025240193A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GR20240100343 2024-05-14
GR20240100343 2024-05-14

Publications (1)

Publication Number Publication Date
WO2025240193A1 true WO2025240193A1 (en) 2025-11-20

Family

ID=96094551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/028254 Pending WO2025240193A1 (en) 2024-05-14 2025-05-07 Signal synthesis using dynamic decoder complexity scaling based on device load

Country Status (2)

Country Link
TW (1) TW202544795A (en)
WO (1) WO2025240193A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9613624B1 (en) * 2014-06-25 2017-04-04 Amazon Technologies, Inc. Dynamic pruning in speech recognition
US20220182495A1 (en) * 2020-12-08 2022-06-09 T-Mobile Usa, Inc. Machine learning-based audio codec switching
US11526734B2 (en) 2019-09-25 2022-12-13 Qualcomm Incorporated Method and apparatus for recurrent auto-encoding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9613624B1 (en) * 2014-06-25 2017-04-04 Amazon Technologies, Inc. Dynamic pruning in speech recognition
US11526734B2 (en) 2019-09-25 2022-12-13 Qualcomm Incorporated Method and apparatus for recurrent auto-encoding
US20220182495A1 (en) * 2020-12-08 2022-06-09 T-Mobile Usa, Inc. Machine learning-based audio codec switching

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"Enhancing into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders", ARXIV:2102.06610V1, 12 February 2021 (2021-02-12)
HAN YIZENG ET AL: "Dynamic Neural Networks: A Survey", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE COMPUTER SOCIETY, USA, vol. 44, no. 11, 5 October 2021 (2021-10-05), pages 7436 - 7456, XP011922057, ISSN: 0162-8828, [retrieved on 20211006], DOI: 10.1109/TPAMI.2021.3117837 *
TORTOSA RUBEN ET AL: "Optimal codec selection algorithm for audio streaming", 2014 IEEE GLOBECOM WORKSHOPS (GC WKSHPS), IEEE, 8 December 2014 (2014-12-08), pages 237 - 242, XP032747896, [retrieved on 20150318], DOI: 10.1109/GLOCOMW.2014.7063437 *
YI LUO ET AL: "Gull: A Generative Multifunctional Audio Codec", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 April 2024 (2024-04-07), XP091721877 *
ZHANG: "Autoencoder and its various variants", 2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS

Also Published As

Publication number Publication date
TW202544795A (en) 2025-11-16

Similar Documents

Publication Publication Date Title
EP4029016B1 (en) Artificial intelligence based audio coding
EP4416726B1 (en) Audio coding using machine learning based linear filters and non-linear neural sources
RU2636685C2 (en) Decision on presence/absence of vocalization for speech processing
US10141001B2 (en) Systems, methods, apparatus, and computer-readable media for adaptive formant sharpening in linear prediction coding
EP4416722B1 (en) Audio coding using combination of machine learning based time-varying filter and linear predictive coding filter
WO2025240193A1 (en) Signal synthesis using dynamic decoder complexity scaling based on device load
WO2025240194A1 (en) Signal synthesis using temporal upsampling of input features
WO2025240227A1 (en) Generative audio codec for signal synthesis based on joint coding of spectral envelope features and pitch information
WO2025240195A1 (en) Generative audio codec for signal synthesis based on spectral envelope features and pitch information
US9236058B2 (en) Systems and methods for quantizing and dequantizing phase information
WO2025240231A1 (en) Generative audio codec for signal synthesis based on groupwise joint coding of spectral envelope features and pitch information
WO2025240232A1 (en) Decoder silence generation without coded silence description
WO2025240230A1 (en) Systems and methods for pitch estimation
TW202605806A (en) Generative audio codec for signal synthesis based on spectral envelope features and pitch information
WO2025240228A1 (en) Systems and methods for delay reduction using recurrent layers
WO2025240233A1 (en) Systems and methods for differentiable pulse generation
WO2025240197A1 (en) Systems and methods for differentiable pulse generation
WO2025240229A1 (en) Systems and methods for spectral feature learning for audio
TW202601634A (en) Improved pitch content in neural speech synthesis
WO2025240222A1 (en) Audio decoding with added noise

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25732573

Country of ref document: EP

Kind code of ref document: A1