WAV2SHAPE: HEARING THE SHAPE OF A DRUM MACHINE

-By Han Han, Vincent Lostanlen

New York University

ABSTRACT

Disentangling and recovering physical attributes, such as shape and material, from a few waveform examples is a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as well as structural engineering. We propose to address this problem via a combination of time–frequency analysis and supervised machine learning. We start by synthesizing a dataset of sounds using the functional transformation method. Then, we represent each percussive sound in terms of its time-invariant scattering transform coefficients and formulate the parametric estimation of the resonator as multidimensional regression with a deep convolutional neural network. We interpolate scattering coefficients over the surface of the drum as a surrogate for potentially missing data, and study the response of the neural network to interpolated samples. Lastly, we resynthesize drum sounds from scattering coefficients, therefore paving the way towards a deep generative model of drum sounds whose latent variables are physically interpretable.

Throughout musical traditions, drums come in all shapes and sizes. Such diversity in manufacturing results in a wide range of perceptual attributes: bright, warm, mellow, and so forth. Yet, current approaches to drum music transcription, which are based on one-versusall classification, fail to capture the multiple factors of variability underlying the timbre perception of percussive sounds. Instead, they regard each item in the drum kit as a separate category, and rarely account for the effect of playing technique. Therefore, in the context of music information retrieval (MIR), the goal of broadening and refining the vocabulary of percussive sound recognition systems requires to move away from discrete taxonomies.

In a different context, prior literature on musical acoustics has managed to simulate the response of a drum from the knowledge of its shape and material. Among studies on physical modeling of musical instruments, functional transformation method (FTM) and finite difference method (FDM) play a central role. They rely on partial differential equations (PDE) to describe the structural and material constraints imposed by the resonator. The coefficients governing these equations may be varied continuously. Thus, PDE-based models for drum sound synthesis offer a fine level of expressive control while guaranteeing physical plausibility and interpretability.

Figure 1: Drums of various shapes and materials. Left to right: mbejn, 19th century, Fang people of Gabon; ceramic drum, 1st century, Nasca people of Peru; darabukka, 19th century, Syria; tympanum of a Pejeng-type drum, Bronze age, Indonesia (Sumba); pakhavaj, 19th century, North India; ipu hula, 19th century, Hawai’i; frame drum, 19th century, Native American people of Dakota; Union army drum, ca. 1864, Pennsylvania. All images are in the public domain and can be accessed at: www.metmuseum.org

From a musical standpoint, a major appeal behind physical models lies in auditory perception: all other things being equal, larger drums tend to sound lower, stiffer drums tend to sound brighter, and so forth. Yet, a major drawback of PDE-based modeling for drum sound synthesis is that all shape and material parameters must be known ahead of time. If, on the contrary, these parameters are unknown, adjusting the synthesizer to match a predefined audio sample incurs a process of multidimensional trial and error, which is tedious and unscalable. This is unlike other methods for audio synthesis, such as digital waveguide or modal synthesis.

Here, they strive towards resolving the tradeoff between control and flexibility in drum sound synthesis. To this end, they formulate the identification of percussive sounds as an inverse problem, thus combining insights from physical modeling and statistical machine learning. Main contribution is wav2shape, i.e., a machine listening system which takes a drum stroke recording as input and retrieves the shape parameters which produced it. The methodological novelty of wav2shape lies in its hybrid architecture, combining feature engineering and feature learning: indeed, it composes a 1-D scattering transform and a deep convolutional network to learn the task of shape regression in a supervised way. The advantage of choosing scattering coefficient over conventional audio descriptor such as MFCC and CQT in characterizing nonstationary sounds has been discussed in previous works

The subtitle of this paper is a deliberate reference to a famous mathematical paper named “Can One Hear the Shape of a Drum?” that is, whether any two isospectral planar domains are necessarily isometric. Since its publication, this question has been answered affirmatively in the important particular cases of circular and rectangular domains; but negatively in the general case, with the construction of nonconvex counterexamples.

wav2shape focuses on representing rectangular and circular membranes, which are by far the most common in music. In return, while is restricted to the recovery of the domain under forced oscillations, wav2shape also expresses the effects of stiffness and damping, both frequency-dependent and frequency independent. These effects are crucial for modeling the response of the drum membrane to a localized impulse, e.g. induced by the player’s hand, a stick, or a mallet. Our main finding is that, after training, wav2shape is able to generalize to previously unseen shapes. Add an additional experiment, we interpolate the value of scattering coefficients over the 2-D surface of the drum and verify that the convnet in wav2shape generalizes to interpolated drum stroke locations. Lastly, we invert the scattering transform operator, thus laying the foundations for turning wav2shape into a deep generative model without explicit knowledge of the partial differential equation (PDE) underlying the vibration of the membrane.

Deep convolutional network:

wav2shape In order to learn a nonlinear mapping between waveform and the set of physical parameters, we train a convolutional neural network, dubbed wav2shape (“wave to shape”). Comprising four 1-D convolutional layers and two fully connected dense layers, wav2shape is configured as follows:

• layer 1: The input feature matrix passes through a batch normalization layer, then 16 convolutional filters with a receptive field of 8 temporal samples. The convolution is followed by a rectified linear unit (ReLU) and average pooling over 4 temporal samples.

• layer 2, 3, and 4: same as layer 1, except that the batch normalization happens after the convolution. The average pooling filter in layer 4 has a receptive field of 2 temporal samples, due to constraint in the time dimension. After that, layer 4 is followed by a “flattening” operation.

• layer 5: 64 hidden units, followed by a ReLU activation function.

• layer 6: 5 hidden units, followed by a linear activation function.

CONCLUSIONS

Wav2shape: a convolutional neural network which disentangles and retrieves physical parameters from waveforms of percussive sounds. First, we have presented a 2-D physical model of a rectangular membrane, based on a fourth-order partial differential equation (PDE) in time and space. We have solved the PDE in closed form by means of the functional transformation method (FTM), and included a freely downloadable VST plugin which synthesizes drum sounds in real time. Then, we have computed second-order scattering coefficients of these sounds and designed wav2shape as a convolutional neural network (CNN) operating on the logarithm of these coefficients. We have trained wav2shape in a supervised fashion in order to regress the parameters underlying the PDE, such as pitch, sustain, and inharmonicity.

From an experimental standpoint, we have found that wav2shape is capable of generalizing beyond its training set and predicting the shape of previously unseen sounds. The network’s robustness in shape regression confirmed that the scattering transform has the ability to linearize the dependency of the signal upon the position of the drum stroke. Indeed, when applied on linearly interpolated scattering coefficients, the wav2shape neural network continues to produce an interpretable outcome. Lastly, we have used reverse-mode automatic differentiation in the Kymatio library to synthesize drum sounds directly from scattering coefficients, without explicitly solving a partial differential equation.

Although the results of wav2shape are promising, we acknowledge that it suffers from some practical limitations, which hamper its usability in computer music creation. First, physical parameters such as inharmonicity D and aspect ratio α are not recovered as accurately as pitch ω or sustain τ . Secondly, wav2shape is only capable of retrieving the shape vector θ if the rectangular drum is stroked exactly at its center: it would be beneficial, albeit challenging, to generalize the approach to any stroke location u0. Thirdly, we have trained wav2shape on a relatively large training set of over 82k audio samples. The acquisition of these samples was only made possible by simulating the response of the membrane. The prospect of extending autonomous systems from such a simulated environment towards a real environment is a topic of ongoing research in reinforcement learning, known as sim2real. Yet, the field of deep learning for musical acoustics predominantly relies on supervised learning techniques instead of reinforcement learning. In this context, we believe that future research is needed to strengthen the interoperability between physical modeling and data-driven modeling of musical sounds.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog

WAV2SHAPE: HEARING THE SHAPE OF A DRUM MACHINE

Labels

Comments

Popular posts from this blog

ABOD and its PyOD python module

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY