Audio Decomposition – open-source seperation of music to constituent instruments

314
64
thunderbong
10 months ago
matthew-bird.com

timlod
·
10 months ago
·
[ - ]

The title is a bit confusing as open-source separation of ... reads like source separation, which this is not. Rather, it is a pitch detection algorithm which also classifies the instrument the pitch originated with.

I think it's really neat, but the results look like it could take more time to fix the output than using a manual approach (if really accurate results are required).

earthnail
·
10 months ago
·
[ - ]

Thanks for clarifying.

In fairness to the author, he is still at high school: https://matthew-bird.com/about.html

Amazing work for that age.

veunes
·
10 months ago
·
[ - ]

He's definitely a talent to watch!

timlod
·
10 months ago
·
[ - ]

Wow, I didn't see that. Great to see this level of interest early on!

TazeTSchnitzel
·
10 months ago
·
[ - ]

Is “source separation” better known as “stem separation” or is that something else? I think the latter term is the one I usually hear from musicians who are interested in taking a single audio file and recovering (something approximating) the original tracks prior to mixing (i.e. the “stems”).

timlod
·
10 months ago
·
[ - ]

Audio Source Separation I think is the general term used in research. It is often applied to musical audio though, where you want to do stem separation - that's source separation where you want to isolate audio stems, a term referring to audio from related groups of signals, e.g. drums (which can contain multiple individual signals, like one for each drum/cymbal).

Earw0rm
·
10 months ago
·
[ - ]

Stem separation refers to doing it with audio playback fidelity (or an attempt at that). So it should pull the bass part out at high enough fidelity to be reused as a bass part.

This is a partly solved problem right now. Some tracks and signal types can be unmixed easier than others, it depends on what the sources are and how much post-processing (reverb, side chaining, heavy brick wall limiting and so on)

dylan604
·
10 months ago
·
[ - ]

> This is a partly solved problem right now.

I'd agree with the partly. I have yet to find one that either isolates an instrument as a separate file or removes one from the rest of the mix that does not negatively impact the sound. The common issues I hear are similar to the early internet low bit rate compression. The new "AI" versions are really bad at this, but even the ones available before the AI craze were still susceptible

mh-
·
10 months ago
·
[ - ]

I'm far (far) from an expert in this field, but when you think about how audio is quantized into digital form, I'm really not sure how one solves this with the current approaches.

That is: frequencies from one instrument will virtually always overlap with another one (including vocals), especially considering harmonics.

Any kind of separation will require some pretty sophisticated "reconstruction" it seems to me, because the operation is inherently destructive. And then the problem becomes one of how faithful the "reproduction" is.

This feels pretty similar to the inpainting/outpainting stuff being done in generative image editing (a la Photoshop) nowadays, but I don't think anywhere near the investment is being made in this field.

Very interested to hear anyone with expertise weigh in!

nineteen999
·
10 months ago
·
[ - ]

I won't say expertise, but what I've done recently:

1) used PixBim AI to extract "stems" (drums, bass, piano, all guitars, vocals). Obviously a lossless source like FLAC works better than MP3 here

2) imported the stems to ProTools.

3) from there, I will usually re-record the bass, guitars, pianos and vocals myself. Occassionally the drums as well.

This is a pretty good way I found to record covers of tracks at home, re-using the original drums if I want to, keeping the tempo of the original track intact etc. I can embellish/replace/modify/simplify parts that I re-record obviously.

It's a bit like drawing using tracing paper, you're creating a copy to the best of your ability, but you have a guide underneath to help you with placement.

Earw0rm
·
10 months ago
·
[ - ]

It's not really digital quantisation that's the problem, but everything else that happens during mixing - which is a much more complicated process, especially for pop/rock/electronic etc., than just "sum all the signals together".

There's a bunch of other stuff that happens during and after summing which makes it much harder to reliably 100% reverse that process.

mh-
·
10 months ago
·
[ - ]

I didn't mean to say that quantization was the problem, just that you're basically trying to pick apart a "pixel" (to continue my image-based analogy) that is a composite of multiple sounds (or partially-transparent image layers).

I was sincere when I said:

> I'm really not sure how one solves this with the current approaches.

I was hoping someone would come along and say it is, in fact, possible. :)

popalchemist
·
10 months ago
·
[ - ]

Source separation is a general term, stem separation is a specific instance of source separation.

emptiestplace
·
10 months ago
·
[ - ]

No, it doesn't read like that. The hyphen completely eliminates any possible ambiguity.

ipsum2
·
10 months ago
·
[ - ]

The title of the submission was modified. It you read the article it says:

Audio Decomposition [Blind Source Seperation]

croes
·
10 months ago
·
[ - ]

Maybe added later by OP? Because there is no hyphen in the article’s subtitle.

>Open source seperation of music into constituent instruments.

emptiestplace
·
10 months ago
·
[ - ]

The complaint:

> The title is a bit confusing as open-source separation of ... reads like source separation, which this is not.

loubbrad
·
10 months ago
·
[ - ]

I didn't see it referenced directly anywhere in this post. However, for those interested, automatic music transcription (i.e., audio->MIDI) is actually a decently sized subfield of deep learning and music information retrieval.

There have been several successful models for multi-track music transcription - see Google's MT3 project (https://research.google/pubs/mt3-multi-task-multitrack-music...). In the case of piano transcription, accuracy is nearly flawless at this point, even for very low-quality audio:

https://github.com/EleutherAI/aria-amt

Full disclaimer: I am the author of the above repo.

Earw0rm
·
10 months ago
·
[ - ]

He's trying to solve a second (also hard ish) problem as well, deriving an accurate musical score from MIDI data. It's a "sounds easy but isn't" problem, especially when audio to MIDI transcribers are great at pitch and onset times, but rather less reliable at duration and velocity.

loubbrad
·
10 months ago
·
[ - ]

I agree that the audio->score and MIDI->score problems are quite hard. There has been research in this area too, however it is far less developed than audio->MIDI.

Earw0rm
·
10 months ago
·
[ - ]

That's because MIDI doesn't contain all the information that was in a score.

Scores are interpreted by musicians to create a performance, and MIDI is a capture of (some of) the data about that performance. Music engraving is full of implicit and explicit cultural rules, and getting it _right_ has parallels with handwritten kanji script in terms of both the importance of correctness to the reader, and the amount of traps for the unwary or uncultured.

All of which can be taken to mean "classical musicians are incredibly picky and anal about this stuff", or, "well-formed music notation conveys all sorts of useful contextual information beyond simply 'what note to play when'".

pclmulqdq
·
10 months ago
·
[ - ]

A lot of modern scores are written with MIDI in mind (whether or not the composer knows it - that's how they hear it the first 50 or so times). That should make it somewhat easier to go MIDI -> score for similar pieces. Current attempts I have seen still make a lot of stupid errors like making note durations too precise and spelling accidentals badly. There's probably still a lot of low-hanging fruit.

This is absolutely not easy, though, given all the cultural context. Things like picking up a "legato" or "cantabile" marking and choosing an accent vs a dagger or a marcato mark are going to be very difficult no matter what.

bravura
·
10 months ago
·
[ - ]

I know the reported scores of MT3 are very good, but have you had success with using it yourself?

https://replicate.com/turian/multi-task-music-transcription

I ported their colab to runtime so I could use it more easily.

The MIDI output is... puzzling?

I've tried feeding it even simple stems and found the output unusable for some tracks, i.e. the MIDI output and audio were not well aligned and there were timing issues. On other audio it seemed to work fine.

loubbrad
·
10 months ago
·
[ - ]

Multi-track transcription has a long way to go before it seriously useful for real-world applications. Ultimately I think that converting audio into MIDI makes a lot more sense for piano/guitar transcription than it does for complex multi-instrument works with sound effects ect...

Luckily for me, audio-to-seq approaches do work very well for piano, which turns out to be an amazing way of getting expressive MIDI data for training generative models.

air217
·
10 months ago
·
[ - ]

I developed https://pyaar.ai, it uses MT3 under the hood. I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI. Piano works much better because it's more discrete (the keys abstract away the strings) and so the MIDI file has better representation

duped
·
10 months ago
·
[ - ]

> I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI.

It's just pitch bend?

I think trying to transcribe as MIDI is just a fundamentally flawed approach that has too many (well known) pitfalls to be useful.

A trained human can listen to a piece and transcribe it in seconds, but programming it as MIDI could take minutes/hours. If you're not trying to replicate how humans learn by ear, you're probably approaching this wrong.

WiSaGaN
·
10 months ago
·
[ - ]

How does the problem simplify when it's restricted to piano?

loubbrad
·
10 months ago
·
[ - ]

Essentially, the leading way to do automatic music transcription is to train a neural network on supervised data, i.e., paired audio-MIDI data. In the case of piano recordings, there is a very good dataset for this task which was released by Google in 2018:

https://magenta.tensorflow.org/datasets/maestro

Most current research involves refining deep learning based approaches to this task. When I worked on this problem earlier this year, I was interested in adding robustness to these models by training a sort of musical awareness into them. You can see a good example of it in this tweet:

https://x.com/loubbrad/status/1794747652191777049

fxj
·
10 months ago
·
[ - ]

If you are interested in audio (or stem) separation have a look at RipX

https://hitnmix.com/ripx-daw-pro/

It can even export the separated tracks as midi files. It still has some problems but works very well. Stem separation is now standard in the musical software and almost every DAW provides it.

tasty_freeze
·
10 months ago
·
[ - ]

RipX can do stem separation and allows repitching notes in the mix. If that is what you want to do it is great.

I find moises (https://moises.ai/) to be easy to use for the tasks I need to do. It allows transposing or time scaling the entire song. It does stem separation and has a simple interface for muting and changing the volume on a per-track basis. It auto-detects the beat and chords.

I'm not affiliated, just a happy nearly-daily user for learning and practicing songs. I boost the original bass part and put everything else at < 10% volume to hear the bass part clearly clearly (which often shows how bad online transcriptions are, even paid ones). Once once I know the part, I mute the bass part and play along with the original song as if I was the bass player.

alok-g
·
10 months ago
·
[ - ]

Moises looks promising.

I wonder why pricing information is so hard to find these days. Would like to get an idea of the same.

sbarre
·
10 months ago
·
[ - ]

Stemroller[0] has been around for a while too, it's free and based on Meta's models:

0: https://www.stemroller.com/

cloudking
·
10 months ago
·
[ - ]

I've heard Meta's Demucs is SOTA, has anything else better come out since?

adzm
·
10 months ago
·
[ - ]

It's still pretty much the best, though there are fine tunings and tweaks on top of that and the runner-up MDX that work well for specific scenarios.

oidar
·
10 months ago
·
[ - ]

> almost every DAW provides it.

It's an up and coming feature that nearly every DAW should have, but most don't yet.

Ableton Live - No

Bigwig - No

Cubase - No

FL - Yes

Logic - Yes

Pro Tools - No

Reason - No

Reaper - No

Studio One - Yes

fxj
·
10 months ago
·
[ - ]

MPC3 - Yes

Mixcraft - Yes

Maschine3 - Yes

antback
·
10 months ago
·
[ - ]

It appears to be related to Polymath.

https://github.com/samim23/polymath

Polymath is effective at isolating and extracting individual instrument tracks from MP3s. It works very well.

makz
·
10 months ago
·
[ - ]

Thanks for the information. I’m a long time Logic Pro user and I wasn’t aware of this feature.

Sporktacular
·
10 months ago
·
[ - ]

On an M1/2/3/4 processor. Not Intel.

bottom999mottob
·
10 months ago
·
[ - ]

This is really cool, but there's real-world instrument physics that might not be captured by simple Fourier transform templates, like a trumpet playing softly can have a significantly different harmonic spectrum than the same trumpet playing loudly, even at the same pitch

Trumpets produce a rich harmonic series with strong overtones, meaning their Fourier transform would show prominent peaks at integer multiples of the fundamental frequency. Instruments like flutes have more pure tones, but brass instruments typically have stronger higher harmonics, which would lead to more complex partial derivatives in the matrix equation shown in the article

So this script uses bandpass filtering and cross-correlation of attack/release envelopes to identify note timing. Given that brass instruments can exhibit non-linear behavior where the harmonic content changes significantly with playing intensity (think of the brightness difference between pp and ff passages), not sure how would this algorithm could handle intensity-dependent timbral variations. I'd consider adding intensity-dependent Fourier templates for each instrument to improve accuracy

atoav
·
10 months ago
·
[ - ]

As someone who uses source separation twice a week for mixing purposes the number of other instruments that can produce sounds of "vocal" quality is high. These models all stop functiining well when you have bands where the instruments don't sound typical and aren't played and/or mixed in a way that achieves maximum separation between them — e.g. an electrical guitar with a distorted harmonic hitting the same note as your singer while the drummer plays only shrieking noises on their cymbals and the bass player simulates a punching kick drum on their instrument.

In these situations (experimental music) source separation will produce completely unpredictable results, thst may or may not be useful for musical rebalancing.

fnordlord
·
10 months ago
·
[ - ]

What tool do you use for the source separation? Everything I've used so far is great for learning or transcribing to MIDI but the separated tracks always have a strange phasing sound to them. Are you doing something to clean that up before mixing back in or are the results already good enough?

atoav
·
10 months ago
·
[ - ]

iZotope RX with musical rebalance, great to reduce drum spill from vocal mics

ekianjo
·
10 months ago
·
[ - ]

Looks like this may be the work of Joshua Bird's little brother (?). Joshua bird did some impressive projects already, that were featured on HN before: https://www.youtube.com/@joshuabird333

njb99
·
10 months ago
·
[ - ]

Yes, Matt is Josh's little brother. I'm impressed - and very pleased - you noticed this.

generalizations
·
10 months ago
·
[ - ]

No one else is going to mention that "separation" was misspelled four times?

orbitingpluto
·
10 months ago
·
[ - ]

If we can all hear the tiny violin, who cares?

generalizations
·
10 months ago
·
[ - ]

Degradation of the environment. https://en.wikipedia.org/wiki/Broken_windows_theory#Theoreti...

orbitingpluto
·
10 months ago
·
[ - ]

Someone created something. Its value greatly exceeds the perceived "degradation of the environment" of a spelling mistake. Not acknowledging that says more about the pedant than the creator.

baq
·
10 months ago
·
[ - ]

Got a flashback of playing audiosurf 15 or so years ago. Time flies.

https://en.wikipedia.org/wiki/Audiosurf

ipsum2
·
10 months ago
·
[ - ]

I must be dumb, but none of the YouTube video demos are demonstrating source separation?

Edit: to clarify, source separation in audio research means separating out the audio into separate clips.

atoav
·
10 months ago
·
[ - ]

I think decomposition is the word, source separation in this case (misleadingly) referes to the fact that the decomposed notes can be separated into different sources.

wkjagt
·
10 months ago
·
[ - ]

The "source" here goes with "open source".

fonema
·
10 months ago
·
[ - ]

I'm a long-time fan of Ultrastar Deluxe, which is an open-source clone of Singstar. This is a karaoke game where people compete by singing along to the tune. It recognizes the notes you are singing and compares them to a vocals-timings mapping file for that particular song. The better you sing to the tune (getting the words correct doesn't matter), the higher your score.

While there are extensive libraries of fan-made song mappings, it's never enough, and there are very few mapped songs in languages other than English or Spanish (if you or your friends prefer your native language). Doing the entire mapping manually is time-consuming, not to mention that I am almost tone-deaf myself, which would make it even more difficult. I have been wondering for a long time what software I could use to make this process easier to automate. This seems like a great tool for capturing vocal timings and notes from original songs.

I have it on my bucket list to create a Singstar playlist in my native language and host a singing party with friends.

Does anyone have suggestions for other similar tools?

alok-g
·
10 months ago
·
[ - ]

Lovely. I did not know of this.

Sounds like the text file needs vocals and pitches along with time stamps. AI is getting there to allow automating it's creation.

For myself: Adding a link I just found for reading further.

https://www.reddit.com/r/karaoke/comments/x61kzy/modern_equi...

DidYaWipe
·
10 months ago
·
[ - ]

Some of those videos don't have audio, as far as I can tell...

tjoff
·
10 months ago
·
[ - ]

The youtube links explains why: "No audio as a result of copyright." And also has a link to the audio that you can play alongside.

DidYaWipe
·
10 months ago
·
[ - ]

Of course, we can't expect Google to respect the obvious fair-use nature of these demonstrations.

bastloing
·
10 months ago
·
[ - ]

I can't find the source code, but the project looks interesting.

ssttoo
·
10 months ago
·
[ - ]

There’s a GitHub link right below the videos https://github.com/mbird1258/Audio-Decomposition

bastloing
·
10 months ago
·
[ - ]

Thanks! Nice! This kid is pretty sharp, can't wait to see what else he does!

kasajian
·
10 months ago
·
[ - ]

dude can't spell

berbec
·
10 months ago
·
[ - ]

He's in high school and pulls of a project like this. I thought I was slick convincing the 7-11 guy to give me my Twist-a-Pepper soda without charging me bottle deposit or tax.

testoveride
·
10 months ago
·
[ - ]