Audio deep learning is a broad field that involves using deep learning techniques to classify and generate audio. Recent research has focused primarily upon speech synthesis, however there are many other useful applications. Our work investigates musical transposition: a system that will automatically remap the pitch and respective harmonics of a given sound to a new frequency mapping. A fully developed automatic transposition system has many potential applications ranging from music production to medical uses in hearing devices.
This is not a simple task; generating audio with neural networks has many challenges that originate from the nature of working with audio and the necessity to maintain both local and global structure of the audio. In addition, a single second of audio has upwards of 16,000 samples which introduces complexity due to the information density of the signal. Current models generate audio one sample at a time and take about 9 minutes to generate a single second of audio, which is prohibitively slow for real-time use cases. This motivates us to attempt to design a system that works in near-real time, while attempting to minimize noise and interference.
We take batches of raw audio as input to a neural network designed to learn encodings of pitch and timbre (tone color).These encodings are then used to generate audio that has been shifted up in pitch while maintaining the perception of the music’s timbre. For our model, we choose to use a stack of recurrent neural networks (specifically, BiLSTMs). This allows us to capture some of the time-dependencies inherent to audio while maintaining a simple network that allows for near-real-time audio generation. This model was then trained on 40,000 pure tones across the spectrum of human hearing to learn to double the frequency of each tone, which increases each pitch by one octave.
We found that our model was able to generalize well and remap audio pitch in near-real-time. Not only did our model generalize from monophonic pure tones to polyphonic “complex” tones (sounds that have more than one pitch simultaneously), we found it also learned to keep the timbre of the sound it was transposing. Because of the simpler model, our results have a noticeable amount of noise in the audio. However, we are continuing to develop ways to improve the quality of the audio from the network as well as post-processing noise reduction techniques. This work provides a starting point for future research in automatic transposition using deep learning.
This project was completed under the direction of Dr. Patrick Donnelly.