
How to make a machine listen to sound like a human? Part 2

Neural networks (NNs) are very good at extracting abstract representations of data and are therefore ideal for detecting cognitive properties in sound. To build a system for this purpose, let us first investigate how sound is represented in the human hearing organ, which we can use to motivate neural networks to process representations of sound meaning.

cochlear representation
Human hearing begins with the external ear, which first consists of the atrium. The earpiece acts as a form of sound spectral preprocessing, where the input sound is modified based on its orientation relative to the listener. The sound then enters the ear canal through an opening in the atrium and subsequently modifies the spectral characteristics of the incoming sound by resonating this amplified frequency (ranging from ~1-6 kHz) [1].
How to make a machine listen like a human
Illustration of the human auditory system
When the sound waves reach the end of the ear canal, they excite the eardrum, to which the ossicles (the smallest bones in the human body) are attached. These bones transmit pressure from the ear canal to the fluid-filled cochlea of the inner ear [1]. The cochlea plays an important role in guiding the representation of sound meaning for neural networks (NN), as this is the organ responsible for translating acoustic vibrations into human neural activity.
It is a coiled tube that is separated along its length by two membranes, Reisner’s membrane and the basement membrane. Throughout the cochlea, there is a row of about 3,500 inner hair cells [1]. When pressure enters the cochlea, its two membranes depress. The basement membrane is narrower and stiffer at the base, but wider and looser at its apex, making the response at a particular frequency stronger at each place along its length.
In simple terms, the basilar membrane can be thought of as a set of continuous membrane-length bandpass filters that separate sounds into their spectral components.
How to make a machine listen like a human
Illustration of the human cochlea
This is the most fundamental mechanism by which humans convert sound pressure into neural activity. Therefore, it is reasonable to assume that the spectral representation of sound is advantageous when building models of sound perception with artificial intelligence. Because the frequency response in the basilar membrane varies exponentially, a logarithmic representation of the frequency is probably the most efficient. Such a frequency representation can be generated using a filter bank of gamma tones. These filters are commonly used in spectral filtering modeling of the auditory system because they can estimate the impulse response of human auditory filters arising from auditory nerve fibers in response to a type of white noise called the “revcor” function.
How to make a machine listen like a human
Comparison of simplified human profile transduction and digitized profile transduction
The cochlea has about 3,500 inner hair cells, and humans can detect gaps in sounds 2 to 5 ms long, so spectral decomposition using 3,500 gamma tone filters divided into 2 ms windows seems like a machine to achieve. a spectrum similar to the human, the best parameter to represent. However, in real-world scenarios, I believe that less spectral decomposition also achieves desirable results in most analysis and processing tasks, while being computationally more feasible.
Various software libraries for auditory analysis are available online. An important example is Jason Heeris’ Gammatone Filterbank Toolkit, which not only provides tunable filters, but also provides tools for spectral analysis of sound signals using gammatone filters.
neural coding
As neural activity moves from the cochlea to the auditory nerve and ascending auditory pathways, several processes take place in brainstem nuclei before it reaches the auditory cortex.
These procedures build a neural code that represents the interaction between the stimulus and the perception. Much more about the specific jobs within these kernels are still conjecture or unknown, so I’ll cover how they work at a high level.



