I think it's really neat, but the results look like it could take more time to fix the output than using a manual approach (if really accurate results are required).
In fairness to the author, he is still at high school: https://matthew-bird.com/about.html
Amazing work for that age.
This is a partly solved problem right now. Some tracks and signal types can be unmixed easier than others, it depends on what the sources are and how much post-processing (reverb, side chaining, heavy brick wall limiting and so on)
I'd agree with the partly. I have yet to find one that either isolates an instrument as a separate file or removes one from the rest of the mix that does not negatively impact the sound. The common issues I hear are similar to the early internet low bit rate compression. The new "AI" versions are really bad at this, but even the ones available before the AI craze were still susceptible
That is: frequencies from one instrument will virtually always overlap with another one (including vocals), especially considering harmonics.
Any kind of separation will require some pretty sophisticated "reconstruction" it seems to me, because the operation is inherently destructive. And then the problem becomes one of how faithful the "reproduction" is.
This feels pretty similar to the inpainting/outpainting stuff being done in generative image editing (a la Photoshop) nowadays, but I don't think anywhere near the investment is being made in this field.
Very interested to hear anyone with expertise weigh in!
1) used PixBim AI to extract "stems" (drums, bass, piano, all guitars, vocals). Obviously a lossless source like FLAC works better than MP3 here
2) imported the stems to ProTools.
3) from there, I will usually re-record the bass, guitars, pianos and vocals myself. Occassionally the drums as well.
This is a pretty good way I found to record covers of tracks at home, re-using the original drums if I want to, keeping the tempo of the original track intact etc. I can embellish/replace/modify/simplify parts that I re-record obviously.
It's a bit like drawing using tracing paper, you're creating a copy to the best of your ability, but you have a guide underneath to help you with placement.
There's a bunch of other stuff that happens during and after summing which makes it much harder to reliably 100% reverse that process.
I was sincere when I said:
> I'm really not sure how one solves this with the current approaches.
I was hoping someone would come along and say it is, in fact, possible. :)
Audio Decomposition [Blind Source Seperation]
>Open source seperation of music into constituent instruments.
> The title is a bit confusing as open-source separation of ... reads like source separation, which this is not.
There have been several successful models for multi-track music transcription - see Google's MT3 project (https://research.google/pubs/mt3-multi-task-multitrack-music...). In the case of piano transcription, accuracy is nearly flawless at this point, even for very low-quality audio:
https://github.com/EleutherAI/aria-amt
Full disclaimer: I am the author of the above repo.
Scores are interpreted by musicians to create a performance, and MIDI is a capture of (some of) the data about that performance. Music engraving is full of implicit and explicit cultural rules, and getting it _right_ has parallels with handwritten kanji script in terms of both the importance of correctness to the reader, and the amount of traps for the unwary or uncultured.
All of which can be taken to mean "classical musicians are incredibly picky and anal about this stuff", or, "well-formed music notation conveys all sorts of useful contextual information beyond simply 'what note to play when'".
This is absolutely not easy, though, given all the cultural context. Things like picking up a "legato" or "cantabile" marking and choosing an accent vs a dagger or a marcato mark are going to be very difficult no matter what.
https://replicate.com/turian/multi-task-music-transcription
I ported their colab to runtime so I could use it more easily.
The MIDI output is... puzzling?
I've tried feeding it even simple stems and found the output unusable for some tracks, i.e. the MIDI output and audio were not well aligned and there were timing issues. On other audio it seemed to work fine.
Luckily for me, audio-to-seq approaches do work very well for piano, which turns out to be an amazing way of getting expressive MIDI data for training generative models.
It's just pitch bend?
I think trying to transcribe as MIDI is just a fundamentally flawed approach that has too many (well known) pitfalls to be useful.
A trained human can listen to a piece and transcribe it in seconds, but programming it as MIDI could take minutes/hours. If you're not trying to replicate how humans learn by ear, you're probably approaching this wrong.
https://magenta.tensorflow.org/datasets/maestro
Most current research involves refining deep learning based approaches to this task. When I worked on this problem earlier this year, I was interested in adding robustness to these models by training a sort of musical awareness into them. You can see a good example of it in this tweet:
https://hitnmix.com/ripx-daw-pro/
It can even export the separated tracks as midi files. It still has some problems but works very well. Stem separation is now standard in the musical software and almost every DAW provides it.
I find moises (https://moises.ai/) to be easy to use for the tasks I need to do. It allows transposing or time scaling the entire song. It does stem separation and has a simple interface for muting and changing the volume on a per-track basis. It auto-detects the beat and chords.
I'm not affiliated, just a happy nearly-daily user for learning and practicing songs. I boost the original bass part and put everything else at < 10% volume to hear the bass part clearly clearly (which often shows how bad online transcriptions are, even paid ones). Once once I know the part, I mute the bass part and play along with the original song as if I was the bass player.
I wonder why pricing information is so hard to find these days. Would like to get an idea of the same.
It's an up and coming feature that nearly every DAW should have, but most don't yet.
Ableton Live - No
Bigwig - No
Cubase - No
FL - Yes
Logic - Yes
Pro Tools - No
Reason - No
Reaper - No
Studio One - Yes
Mixcraft - Yes
Maschine3 - Yes
https://github.com/samim23/polymath
Polymath is effective at isolating and extracting individual instrument tracks from MP3s. It works very well.
Trumpets produce a rich harmonic series with strong overtones, meaning their Fourier transform would show prominent peaks at integer multiples of the fundamental frequency. Instruments like flutes have more pure tones, but brass instruments typically have stronger higher harmonics, which would lead to more complex partial derivatives in the matrix equation shown in the article
So this script uses bandpass filtering and cross-correlation of attack/release envelopes to identify note timing. Given that brass instruments can exhibit non-linear behavior where the harmonic content changes significantly with playing intensity (think of the brightness difference between pp and ff passages), not sure how would this algorithm could handle intensity-dependent timbral variations. I'd consider adding intensity-dependent Fourier templates for each instrument to improve accuracy
In these situations (experimental music) source separation will produce completely unpredictable results, thst may or may not be useful for musical rebalancing.
Edit: to clarify, source separation in audio research means separating out the audio into separate clips.
While there are extensive libraries of fan-made song mappings, it's never enough, and there are very few mapped songs in languages other than English or Spanish (if you or your friends prefer your native language). Doing the entire mapping manually is time-consuming, not to mention that I am almost tone-deaf myself, which would make it even more difficult. I have been wondering for a long time what software I could use to make this process easier to automate. This seems like a great tool for capturing vocal timings and notes from original songs.
I have it on my bucket list to create a Singstar playlist in my native language and host a singing party with friends.
Does anyone have suggestions for other similar tools?
Sounds like the text file needs vocals and pitches along with time stamps. AI is getting there to allow automating it's creation.
For myself: Adding a link I just found for reading further.
https://www.reddit.com/r/karaoke/comments/x61kzy/modern_equi...