Skip to main content

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

-By Jesse Engel,  Cinjon Resnick ,Adam Roberts , Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi






One of the goals of Magenta is to use machine learning to develop new avenues of human expression. And so today we are proud to announce NSynth (Neural Synthesizer), a novel approach to music synthesis designed to aid the creative process.

Unlike a traditional synthesizer which generates audio from hand-designed components like oscillators and wavetables, NSynth uses deep neural networks to generate sounds at the level of individual samples. Learning directly from data, NSynth provides artists with intuitive control over timbre and dynamics and the ability to explore new sounds that would be difficult or impossible to produce with a hand-tuned synthesizer.

The acoustic qualities of the learned instrument depend on both the model used and the available training data, so we are delighted to release improvements to both:

A dataset of musical notes an order of magnitude larger than other publicly available corpora.
A novel WaveNet-style autoencoder model that learns codes that meaningfully represent the space of instrument sounds.


The NSynth Dataset
We wanted to develop a creative tool for musicians and also provide a new challenge for the machine learning community to galvanize research in generative models for music. To satisfy both of these objectives, we built the NSynth dataset, a large collection of annotated musical notes sampled from individual instruments across a range of pitches and velocities. With ~300k notes from ~1000 instruments, it is an order of magnitude larger than comparable public datasets. You can download it here.

A motivation behind the NSynth dataset is that it lets us explicitly factorize the generation of music into notes and other musical qualities. We could further factorize those qualities, but for simplicity we don’t and get the following:

P(audio)=P(audio∣note)P(note)

The goal is to model P(audio∣note) (known as timbre) and it is assumed that P(note) comes from a higher-level “language model” of music, such as the note sequence RNNs we’ve previously described. While not perfect, this factorization is grounded in how instruments work and is surprisingly effective. Indeed, much modern music production employs such a factorization, using MIDI for note sequences and software synthesizers for timbre. Of course, this works better for some instruments (e.g., piano and electronic synthesizer) than for others (e.g., guitar and saxophone) where note-to-note timbre dependencies are more pronounced.

The NSynth dataset was inspired by image recognition datasets that have been core to recent progress in deep learning. Similar to how many image datasets focus on a single object per example, the NSynth dataset hones in on single notes. We encourage the broader community to use it as a benchmark and entry point into audio machine learning. We hope that this serves as a building block for future datasets and envision a high-quality multi-note dataset for tasks like generation and transcription that involve learning complex language-like dependencies.

Learning Temporal Embeddings
WaveNet is an expressive model for temporal sequences such as speech and music. As a deep autoregressive network of dilated convolutions, it models sound one sample at a time, similar to a nonlinear infinite impulse response filter. Since the context of this filter is currently limited to several thousand samples (about half a second), long-term structure requires a guiding external signal. Prior work demonstrated this in the case of text-to-speech and used previously learned linguistic embeddings to create impressive results.

In this work, we removed the need for conditioning on external features by employing a WaveNet-style autoencoder to learn its own temporal embeddings.



The temporal encoder looks very much like a WaveNet and has the same dilation block structure. However, its convolutions are not causal, so it sees the entire context of the input chunk. After thirty layers of computation, there is then a final average pooling to create a temporal embedding of 16 dimensions for every 512 samples. Consequently, the embedding can be thought of as a 32x compression of the original data.

We condition the vanilla WaveNet decoder with this embedding by upsampling it to the original time resolution, applying a 1x1 convolution, and finally adding this result as a bias to each of the decoder’s thirty layers. Note that this conditioning is not external as it’s learned by the model. Since the embeddings bias the autoregressive system, we can imagine it acting as a driving function for a nonlinear oscillator. This interpretation is corroborated by the fact that the magnitude contours of the embeddings mimic those of the audio itself.

To play original Base audio click this link Bass Original and to see Neural networks output click this link Bass WaveNet

of audio and reconstructions for three different instruments. These are CQT spectrograms with magnitude represented by intensity and instantaneous frequency by color. Frequency is on the vertical axis and time is on the horizontal axis. For the embeddings, the different colors represent the 16 different dimensions at 125 timesteps (32ms per step). There is a slight built-in distortion due to the compression of the 8-bit mu-law encoding. It is a minor effect for many samples, but is more pronounced for lower frequencies.

While the WaveNet autoencoder adds more harmonics to the original timbre, it follows the fundamental frequency up and down two octaves. The fact that it has never seen a transition between two notes is clear as its best approximation is to just smoothly glissando between them.

Release++
Besides the music examples and the dataset, we are also releasing the code for both the WaveNet autoencoder powering NSynth as well as our best baseline spectral autoencoder model. In addition, we are releasing the trained weights as a TensorFlow checkpoint and a script to save embeddings from your own WAV files. You can find all the code at our repository and the checkpoint tarball can be downloaded here.

Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods