Work

Auditory-inspired Approaches to Audio Representation and Analysis for Machine Hearing

Public

The study and design of machines that are able to analyze the auditory scene and organize sound into parts that are perceptually meaningful to humans is referred to as machine hearing. Such machines are expected to distinguish between different sound categories (e.g., speech, music, background noise), focus on a sound source of interest coming from a certain direction or accompanied by many different sources (the famous cocktail party problem), and suppress unimportant sounds (e.g., air conditioner humming noise or the background music in a restaurant). The tasks performed by current hearing machines are typically handled by algorithms that are developed separately and independently from one another. Audio source separation, e.g., separating the singing voice in a song from background music, speech recognition in noisy environments or multi-speaker scenarios, and environmental sound detection and classification, e.g., recognizing dog barking or traffic noise are a few examples of such tasks. A common feature of all these sound processing algorithms is that their performance and the difficulty of combining them with other algorithms are heavily affected by the audio representation they receive as input. If the input representation has fundamental limits, the algorithm may not be able to extract the information required for the task, no matter how intelligent it is. Moreover, if the information required for a particular auditory task is deeply buried in a representation, the algorithm performing this task will inevitably grow very complex and task-dependent in its feature extraction stage. Combining a set of single-task algorithms into a multi-task auditory scene analysis system would be both non-trivial and computationally inefficient if each algorithm applies its own task-specific and complex feature extraction stage to a low-level representation shared by all algorithms. Audio source separation refers to the task of estimating n individual sound sources given an m-channel recording of a complex auditory scene. Supervised source separation methods, specifically those using deep neural networks have become very popular over the past decade, due to their successful performance in a variety of denoising and source separation tasks including speech enhancement, speech separation, and music separation. A major challenge faced by supervised masking-based separation approaches is that they typically require a large dataset of isolated sound sources to generate target time-frequency masks used in model training. Obtaining the isolated sources that compose an audio mixture may be expensive or require complicated recording setups. In some scenarios, it may not even be possible to record sounds in isolation, e.g., recording a bird song in a forest. Parsing the auditory scene into meaningful components and focusing on the most informative sound sources are tasks which biological audio processing systems have evolved to perform efficiently. The mammalian auditory system, for instance, has been shown to extract the information required for the analysis of complex auditory scenes very effectively. An interesting example is the existence of neurons in the primary auditory cortex of mammals that respond to a variety of spectro-temporal modulation patterns. Moreover, natural audio-processing systems do not require isolated sources in order to learn to analyze auditory scenes. Humans hardly ever hear sounds in perfect isolation, but still can learn to identify different types of sounds and to focus on them if necessary. Based on the capability of natural auditory systems to extract the characteristics of individual audio sources from everyday complex auditory scenes, one can argue that the knowledge about the presence of sounds in a mixture recording could be sufficient information for training a separation system. In this dissertation, I propose methods for audio signal representation and for training deep models that are inspired by biological auditory systems, addressing two major challenges in the field of audio source separation: i) separation of sources with a high level of energy overlap in both time and frequency domains, and ii) training deep models on the source separation task when ground truth isolated sources are not available. I develop a biologically-inspired audio representation that explicitly encodes spectro-temporal modulation patterns, and hence disentangles audio sources that overlap in time and frequency in a way that is practically useable for source separation and sound object recognition. I further propose a novel approach to training an audio separation system in the absence of strongly labeled auditory scenes. In this approach an audio classification system guides the separation training.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items