-By Han Han, Vincent Lostanlen
New York University
ABSTRACT
Disentangling and recovering physical attributes, such as
shape and material, from a few waveform examples is
a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as
well as structural engineering. We propose to address
this problem via a combination of time–frequency analysis and supervised machine learning. We start by synthesizing a dataset of sounds using the functional transformation method. Then, we represent each percussive sound in
terms of its time-invariant scattering transform coefficients
and formulate the parametric estimation of the resonator
as multidimensional regression with a deep convolutional
neural network. We interpolate scattering coefficients over
the surface of the drum as a surrogate for potentially missing data, and study the response of the neural network to
interpolated samples. Lastly, we resynthesize drum sounds
from scattering coefficients, therefore paving the way towards a deep generative model of drum sounds whose latent variables are physically interpretable.
Throughout musical traditions, drums come in all shapes and sizes. Such diversity in manufacturing results in a wide range of perceptual attributes: bright, warm, mellow, and so forth. Yet, current approaches to drum music transcription, which are based on one-versusall classification, fail to capture the multiple factors of variability underlying the timbre perception of percussive sounds. Instead, they regard each item in the drum kit as a separate category, and rarely account for the effect of playing technique. Therefore, in the context of music information retrieval (MIR), the goal of broadening and refining the vocabulary of percussive sound recognition systems requires to move away from discrete taxonomies.
In a different context, prior literature on musical acoustics has managed to simulate the response of a drum from the knowledge of its shape and material. Among studies on physical modeling of musical instruments, functional transformation method (FTM) and finite difference method (FDM) play a central role. They rely on partial differential equations (PDE) to describe the structural and material constraints imposed by the resonator. The coefficients governing these equations may be varied continuously. Thus, PDE-based models for drum sound synthesis offer a fine level of expressive control while guaranteeing physical plausibility and interpretability.
Figure 1: Drums of various shapes and materials. Left to
right: mbejn, 19th century, Fang people of Gabon; ceramic
drum, 1st century, Nasca people of Peru; darabukka, 19th
century, Syria; tympanum of a Pejeng-type drum, Bronze
age, Indonesia (Sumba); pakhavaj, 19th century, North India; ipu hula, 19th century, Hawai’i; frame drum, 19th
century, Native American people of Dakota; Union army
drum, ca. 1864, Pennsylvania. All images are in the public
domain and can be accessed at: www.metmuseum.org
From a musical standpoint, a major appeal behind physical models lies in auditory perception: all other things being equal, larger drums tend to sound lower, stiffer drums
tend to sound brighter, and so forth. Yet, a major drawback
of PDE-based modeling for drum sound synthesis is that
all shape and material parameters must be known ahead of
time. If, on the contrary, these parameters are unknown,
adjusting the synthesizer to match a predefined audio sample incurs a process of multidimensional trial and error,
which is tedious and unscalable. This is unlike other methods for audio synthesis, such as digital waveguide or
modal synthesis.
Here, they strive towards resolving the tradeoff between control and flexibility in drum sound synthesis. To this end, they formulate the identification of percussive sounds as an inverse problem, thus combining insights
from physical modeling and statistical machine learning.
Main contribution is wav2shape, i.e., a machine listening system which takes a drum stroke recording as input and retrieves the shape parameters which produced it.
The methodological novelty of wav2shape lies in its hybrid architecture, combining feature engineering and feature learning: indeed, it composes a 1-D scattering transform and a deep convolutional network to learn the task
of shape regression in a supervised way. The advantage of
choosing scattering coefficient over conventional audio descriptor such as MFCC and CQT in characterizing nonstationary sounds has been discussed in previous works
The subtitle of this paper is a deliberate reference to
a famous mathematical paper named “Can One Hear the
Shape of a Drum?” that is, whether any two isospectral planar domains are necessarily isometric. Since its
publication, this question has been answered affirmatively
in the important particular cases of circular and rectangular domains; but negatively in the general case, with the
construction of nonconvex counterexamples.
wav2shape focuses on representing rectangular and circular membranes, which are by far
the most common in music. In return, while is restricted to the recovery of the domain under forced oscillations, wav2shape also expresses the effects of stiffness
and damping, both frequency-dependent and frequency independent. These effects are crucial for modeling the
response of the drum membrane to a localized impulse,
e.g. induced by the player’s hand, a stick, or a mallet.
Our main finding is that, after training, wav2shape is
able to generalize to previously unseen shapes. Add an
additional experiment, we interpolate the value of scattering coefficients over the 2-D surface of the drum and verify that the convnet in wav2shape generalizes to interpolated drum stroke locations. Lastly, we invert the scattering
transform operator, thus laying the foundations for turning
wav2shape into a deep generative model without explicit
knowledge of the partial differential equation (PDE) underlying the vibration of the membrane.
Deep convolutional network:
wav2shape
In order to learn a nonlinear mapping between waveform
and the set of physical parameters, we train a convolutional
neural network, dubbed wav2shape (“wave to shape”).
Comprising four 1-D convolutional layers and two fully
connected dense layers, wav2shape is configured as follows:
• layer 1: The input feature matrix passes through a
batch normalization layer, then 16 convolutional filters with a receptive field of 8 temporal samples.
The convolution is followed by a rectified linear unit
(ReLU) and average pooling over 4 temporal samples.
• layer 2, 3, and 4: same as layer 1, except that the
batch normalization happens after the convolution.
The average pooling filter in layer 4 has a receptive
field of 2 temporal samples, due to constraint in the
time dimension. After that, layer 4 is followed by a
“flattening” operation.
• layer 5: 64 hidden units, followed by a ReLU activation function.
• layer 6: 5 hidden units, followed by a linear activation function.
CONCLUSIONS
Wav2shape: a convolutional neural network which disentangles and retrieves
physical parameters from waveforms of percussive sounds.
First, we have presented a 2-D physical model of a rectangular membrane, based on a fourth-order partial differential equation (PDE) in time and space. We have solved the
PDE in closed form by means of the functional transformation method (FTM), and included a freely downloadable
VST plugin which synthesizes drum sounds in real time.
Then, we have computed second-order scattering coefficients of these sounds and designed wav2shape as a convolutional neural network (CNN) operating on the logarithm
of these coefficients. We have trained wav2shape in a supervised fashion in order to regress the parameters underlying the PDE, such as pitch, sustain, and inharmonicity.
From an experimental standpoint, we have found that
wav2shape is capable of generalizing beyond its training
set and predicting the shape of previously unseen sounds. The network’s robustness in shape regression confirmed that the scattering transform has the ability to linearize the dependency of the signal upon the position of the drum stroke. Indeed, when applied on linearly interpolated scattering coefficients, the
wav2shape neural network continues to produce an interpretable outcome. Lastly, we have used reverse-mode automatic differentiation in the Kymatio library to synthesize
drum sounds directly from scattering coefficients, without
explicitly solving a partial differential equation.
Although the results of wav2shape are promising, we
acknowledge that it suffers from some practical limitations, which hamper its usability in computer music creation. First, physical parameters such as inharmonicity
D and aspect ratio α are not recovered as accurately as
pitch ω or sustain τ . Secondly, wav2shape is only capable of retrieving the shape vector θ if the rectangular drum
is stroked exactly at its center: it would be beneficial, albeit challenging, to generalize the approach to any stroke
location u0. Thirdly, we have trained wav2shape on a relatively large training set of over 82k audio samples. The
acquisition of these samples was only made possible by
simulating the response of the membrane. The prospect of extending autonomous systems from such a simulated environment towards a real environment is a topic of ongoing research in reinforcement learning, known as sim2real. Yet, the field of deep learning for musical acoustics
predominantly relies on supervised learning techniques instead of reinforcement learning. In this context, we believe
that future research is needed to strengthen the interoperability between physical modeling and data-driven modeling of musical sounds.
Comments