Audio Descriptors

Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The most common audio descriptors are the following:

  • Temporal audio descriptors. The energy envelope descriptor represents the root mean square of the mean energy of the audio signal, which is suitable for silence detection. The zero crossing rate descriptor represents the number of times the signal amplitude under-goes a change of sign, which is used for differentiate periodic signals and noisy signals, such as to determine whether the audio content is speech or music. The temporal waveform moments descriptor represents characteristics of waveform shape, including temporal centroid, width, asymmetry, and flatness. The amplitude modulation descriptor describes the tremolo of a sustained sound (in the frequency range 4–8 Hz) or the graininess or roughness of a sound (between 10–40Hz). The autocorrelation coefficient descriptor represents the spectral distribution of the audio signal over time, which is suitable for musical instrument recognition.
  • Spectral audio descriptors. The spectral moments descriptor corre-spond to core spectral shape characteristics, such as spectral centroid, spectral width, spectral asymmetry, and spectral flatness, which are useful for determining sound brightness, music genre, and categorizing music by mood. The spectral decrease descriptor describes the average rate of spectral decrease with frequency. The spectral roll-off descriptor represents the frequency under which a predefined percent-age (usually 85–99%) of the total spectral energy is present, which is suitable for music genre classification. The spectral flux descriptor represents the dynamic variation of spectral information computed either as the normalized correlation between consecutive amplitude spectra or the derivative of the amplitude spectrum. The spectral irregularity descriptor describes the amplitude difference between adjacent harmonics, which is suitable for the precise characterization of the spectrum, such as for describing individual frequency components of a sound. The descriptors of formants parameters represent the spectral peaks of the sound spectrum of voice, and are suitable for phoneme and vowel identification.
  • Cepstral audio descriptors. Cepstral features are used for speech and speaker recognition and music modeling. The most common cepstral descriptors are the mel-frequency cepstral coefficient descriptors, which approximate the psychological sensation of the height of pure sounds, and are calculated using the inverse discrete cosine transform of the energy in predefined frequency bands.
  • Perceptual audio descriptors. The loudness descriptor represents the impression of sound intensity. The sharpness descriptor, which corresponds to a spectral centroid, is typically estimated using a weighted centroid of specific loudness. The perceptual spread descriptor characterizes the timbral width of sounds, and is calculated as the relative difference between the specific loudness and the total loudness.
  • Specific audio descriptors. The odd-even harmonic energy ratio descriptor represents the energy proportion carried by odd and even harmonics. The descriptors of octave band signal intensities represent the power distribution of the different harmonics of music. The attack duration descriptor represents how quickly a sound reaches full volume after it is activated, and is used for sound identification. The harmonic-noise ratio descriptor represents the ratio between the energy of the harmonic component and the noise component, and enables the estimation of the amount of noise in the sound. The fundamental frequency descriptor, also known as the pitch descriptor, represents the inverse of the period of a periodic sound.