Skip to main content

WAV2SHAPE: HEARING THE SHAPE OF A DRUM MACHINE

-By Han Han, Vincent Lostanlen 
New York University



ABSTRACT 
Disentangling and recovering physical attributes, such as shape and material, from a few waveform examples is a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as well as structural engineering. We propose to address this problem via a combination of time–frequency analysis and supervised machine learning. We start by synthesizing a dataset of sounds using the functional transformation method. Then, we represent each percussive sound in terms of its time-invariant scattering transform coefficients and formulate the parametric estimation of the resonator as multidimensional regression with a deep convolutional neural network. We interpolate scattering coefficients over the surface of the drum as a surrogate for potentially missing data, and study the response of the neural network to interpolated samples. Lastly, we resynthesize drum sounds from scattering coefficients, therefore paving the way towards a deep generative model of drum sounds whose latent variables are physically interpretable.


Throughout musical traditions, drums come in all shapes and sizes. Such diversity in manufacturing results in a wide range of perceptual attributes: bright, warm, mellow, and so forth. Yet, current approaches to drum music transcription, which are based on one-versusall classification, fail to capture the multiple factors of variability underlying the timbre perception of percussive sounds. Instead, they regard each item in the drum kit as a separate category, and rarely account for the effect of playing technique. Therefore, in the context of music information retrieval (MIR), the goal of broadening and refining the vocabulary of percussive sound recognition systems requires to move away from discrete taxonomies. 
In a different context, prior literature on musical acoustics has managed to simulate the response of a drum from the knowledge of its shape and material. Among studies on physical modeling of musical instruments, functional transformation method (FTM) and finite difference method (FDM) play a central role. They rely on partial differential equations (PDE) to describe the structural and material constraints imposed by the resonator. The coefficients governing these equations may be varied continuously. Thus, PDE-based models for drum sound synthesis offer a fine level of expressive control while guaranteeing physical plausibility and interpretability.







Figure 1: Drums of various shapes and materials. Left to right: mbejn, 19th century, Fang people of Gabon; ceramic drum, 1st century, Nasca people of Peru; darabukka, 19th century, Syria; tympanum of a Pejeng-type drum, Bronze age, Indonesia (Sumba); pakhavaj, 19th century, North India; ipu hula, 19th century, Hawai’i; frame drum, 19th century, Native American people of Dakota; Union army drum, ca. 1864, Pennsylvania. All images are in the public domain and can be accessed at: www.metmuseum.org

From a musical standpoint, a major appeal behind physical models lies in auditory perception: all other things being equal, larger drums tend to sound lower, stiffer drums tend to sound brighter, and so forth. Yet, a major drawback of PDE-based modeling for drum sound synthesis is that all shape and material parameters must be known ahead of time. If, on the contrary, these parameters are unknown, adjusting the synthesizer to match a predefined audio sample incurs a process of multidimensional trial and error, which is tedious and unscalable. This is unlike other methods for audio synthesis, such as digital waveguide or modal synthesis.


Here, they strive towards resolving the tradeoff between control and flexibility in drum sound synthesis. To this end, they formulate the identification of percussive sounds as an inverse problem, thus combining insights from physical modeling and statistical machine learning. Main contribution is wav2shape, i.e., a machine listening system which takes a drum stroke recording as input and retrieves the shape parameters which produced it. The methodological novelty of wav2shape lies in its hybrid architecture, combining feature engineering and feature learning: indeed, it composes a 1-D scattering transform and a deep convolutional network to learn the task of shape regression in a supervised way. The advantage of choosing scattering coefficient over conventional audio descriptor such as MFCC and CQT in characterizing nonstationary sounds has been discussed in previous works

The subtitle of this paper is a deliberate reference to a famous mathematical paper named “Can One Hear the Shape of a Drum?” that is, whether any two isospectral planar domains are necessarily isometric. Since its publication, this question has been answered affirmatively in the important particular cases of circular and rectangular domains; but negatively in the general case, with the construction of nonconvex counterexamples.

wav2shape focuses on representing rectangular and circular membranes, which are by far the most common in music. In return, while is restricted to the recovery of the domain under forced oscillations, wav2shape also expresses the effects of stiffness and damping, both frequency-dependent and frequency independent. These effects are crucial for modeling the response of the drum membrane to a localized impulse, e.g. induced by the player’s hand, a stick, or a mallet. Our main finding is that, after training, wav2shape is able to generalize to previously unseen shapes. Add an additional experiment, we interpolate the value of scattering coefficients over the 2-D surface of the drum and verify that the convnet in wav2shape generalizes to interpolated drum stroke locations. Lastly, we invert the scattering transform operator, thus laying the foundations for turning wav2shape into a deep generative model without explicit knowledge of the partial differential equation (PDE) underlying the vibration of the membrane.  

Deep convolutional network: 

wav2shape In order to learn a nonlinear mapping between waveform and the set of physical parameters, we train a convolutional neural network, dubbed wav2shape (“wave to shape”). Comprising four 1-D convolutional layers and two fully connected dense layers, wav2shape is configured as follows: 

• layer 1: The input feature matrix passes through a batch normalization layer, then 16 convolutional filters with a receptive field of 8 temporal samples. The convolution is followed by a rectified linear unit (ReLU) and average pooling over 4 temporal samples. 

• layer 2, 3, and 4: same as layer 1, except that the batch normalization happens after the convolution. The average pooling filter in layer 4 has a receptive field of 2 temporal samples, due to constraint in the time dimension. After that, layer 4 is followed by a “flattening” operation. 

• layer 5: 64 hidden units, followed by a ReLU activation function. 

• layer 6: 5 hidden units, followed by a linear activation function. 





CONCLUSIONS 

Wav2shape: a convolutional neural network which disentangles and retrieves physical parameters from waveforms of percussive sounds. First, we have presented a 2-D physical model of a rectangular membrane, based on a fourth-order partial differential equation (PDE) in time and space. We have solved the PDE in closed form by means of the functional transformation method (FTM), and included a freely downloadable VST plugin which synthesizes drum sounds in real time. Then, we have computed second-order scattering coefficients of these sounds and designed wav2shape as a convolutional neural network (CNN) operating on the logarithm of these coefficients. We have trained wav2shape in a supervised fashion in order to regress the parameters underlying the PDE, such as pitch, sustain, and inharmonicity. 

From an experimental standpoint, we have found that wav2shape is capable of generalizing beyond its training set and predicting the shape of previously unseen sounds. The network’s robustness in shape regression confirmed that the scattering transform has the ability to linearize the dependency of the signal upon the position of the drum stroke. Indeed, when applied on linearly interpolated scattering coefficients, the wav2shape neural network continues to produce an interpretable outcome. Lastly, we have used reverse-mode automatic differentiation in the Kymatio library to synthesize drum sounds directly from scattering coefficients, without explicitly solving a partial differential equation. 

Although the results of wav2shape are promising, we acknowledge that it suffers from some practical limitations, which hamper its usability in computer music creation. First, physical parameters such as inharmonicity D and aspect ratio α are not recovered as accurately as pitch ω or sustain τ . Secondly, wav2shape is only capable of retrieving the shape vector θ if the rectangular drum is stroked exactly at its center: it would be beneficial, albeit challenging, to generalize the approach to any stroke location u0. Thirdly, we have trained wav2shape on a relatively large training set of over 82k audio samples. The acquisition of these samples was only made possible by simulating the response of the membrane. The prospect of extending autonomous systems from such a simulated environment towards a real environment is a topic of ongoing research in reinforcement learning, known as sim2real. Yet, the field of deep learning for musical acoustics predominantly relies on supervised learning techniques instead of reinforcement learning. In this context, we believe that future research is needed to strengthen the interoperability between physical modeling and data-driven modeling of musical sounds.

Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...