Introduction
Transcription is the act of converting something to a symbolic format. For musical transcription, that format is MIDI. MIDI (Musical Instrument Digital Interface) files keep track of every feature of a note, the time it started, the time it ended, the pitch, and how loud it was, or its velocity. This project uses a neural network to attempt to transcribe audio recordings of solo piano performances into MIDI files.
Data collection
The data for this project comes from the MAESTRO dataset. This dataset contains 1282 piano solos with both audio and MIDI counterparts aligned to within ~3 milliseconds of accuracy. The recordings were collected at the international piano e-competition between the years of 2004 and 2018. The MIDI recordings were obtained from a Yamaha Disklavier piano, which records precise MIDI information as it is played.
Preprocessing
In order to make the data suitable for my neural network, I transformed the MIDI files into time-series. This time-series is in the form of a two-dimensional array. Each column represents a MIDI signal, and each row represents 64 milliseconds of the song. The indices contain a value between one and zero representing the velocity the note is played at. The audio files were also downsampled from 44.1 kHz to 16kHz.
Methods
The way that the neural network works is it takes in an approximately 10 second long audio file, and listens to it, 64 milliseconds at a time. In the training step, it also takes in a single row from the corresponding time-series MIDI file. It learns that the sound that comes from the audio file, equates to the string. It does this 156 times for that 10 second chunk. Now, it has learned that specific section of a specific song. It keeps learning, and after a few songs, it checks itself with a validation step. This step passes in an audio file without the MIDI counterpart, and asks the network to create what it thinks the MIDI file should look like. Once it does, the network then takes the true MIDI counterpart, and compares it to what the model produced. If they are close, the model knows it is doing a good job. If they are different, the model can look at what the differences are, and change things up so that next time it can get closer. Once the model is fully trained, it is given a song to evaluate. It takes in the audio, and outputs a MIDI, this time not checking it against the expected output, because it doesn’t have access to that information.
Results
From this neural network, I was able to obtain a MIDI file that, although not the same as the audio file, had some real similarities in the general shape of the audio. When the velocity of the song was included in the training, the neural network produced a MIDI file that got quieter when the audio got quieter, and louder when the audio got louder. The MIDI also, in general, has higher notes in places where the audio has higher notes, and the same thing with the lower notes. The main difference between the true MIDI and the predicted MIDI is the specific location of the notes, and the duration. (For an example of the network outputs, you may skip to 8:25 in the video below).
Future Work
Currently, I am working on increasing the performance of the neural network. I want to be able to evaluate the songs with objective measures, rather than just subjective, but for that to be worth the time and effort it would take, the predicted output needs to be more similar to the expected output. Once the performance is increased, and the network reliably transcribes solo piano performances, I plan to move on to multi-instrument recordings. These will pose their own challenges as no Disklavier Piano equivalent exists for things like the clarinet, but there are still MIDI recordings of other instruments.
Attachment | Size |
---|---|
Poster Presentation | 673.93 KB |