There's no reason to lock yourself into an intel-only solution. Just use DeepFilterNet. The results of this on my noisy server room were insanely good. Almost no voice dropout with 100% fan noise removal.
https://github.com/Rikorose/DeepFilterNet
EDIT: Even more interesting, it looks like OpenVino is just DeepFilterNet glued to Whisper.cpp and tied to Intel hardware.
https://github.com/intel/openvino-plugins-ai-audacity/tree/m...
> OpenVino is just DeepFilterNet glued to Whisper.cpp and tied to Intel hardware.
Well, no.
When you want to run a model on a truly wide set of devices, you end up sort of wedged into either ONNX, OpenVINO, TensorFlow Lite, and a few other frameworks.
They're all FOSS, and they're software libraries.
YMMV on which is best, of course, but broadly and widely: where are your users, mostly? Desktop? OpenVINO. Web? TensorFlow. Mobile and desktop? ONNX. This isnt entirely accurate because ex. I reach for ONNX every time because that is what I'm familiar with. All of them make effort to reach every platform, ex. OpenVINO goes supports ARM, and not in a trivial manner.
That all being said, TL;DR:
It is "not even wrong", in the Pauli sense, to imply OpenVINO is Intel-only, and to describe OpenVINO as "just glu[ing a model to inference code]"
You're describing 3 different components (a hardware acceleration library, and inference library, and a model) and suggesting the hardware accelerated inference library just glues together a model-specific inference library and a model. The mastroyshka doll is inverted: whisper.cpp uses openvino to acclerate its model-specific inference code.
https://docs.openvino.ai/2024/about-openvino/release-notes-o...
https://www.nvidia.com/en-us/geforce/broadcasting/broadcast-...
https://docs.openvino.ai/2024/about-openvino/release-notes-o...
Go look at CPUs benchmarks on Phoronix; AMD Ryzen cpus regularly trounce Intel cpus using openVINO inference.
I found a very old audio cassette from my childhood with me and some other kids talking while a song is playing in background. I tried subtracting the song using Audacity but for that to work reference song and recording must align "perfectly" which is very very hard. Not just the timing (which i found can be a problem with cassettes) loudness/frequency distribution must also align perfectly.
Found Smartsubtract https://oxfordwaveresearch.com/products/smartsubtract/ which seems to do exactly the same but it's not available for download.
Is there any (AI even?) tool that might do that? I tried an online AI tool which claimed it can extract voices but it returned back silence. I want to try OpenVino but not sure it will be useful with faint spoken words in a noisy environment with a song.
The next question on the problem would be "Give at least three reasons why this doesn't perfectly remove the reference sound," of course.
Other things to do would be to fix any tape warble or flutter, normalize the volume, do simple things like high pass filtering anything below 75hz (as most voices don't make audible volume at those frequencies, especially children's voices.
Then I would get a spectrum analyzer plugin and see if there are any spots that are clearly music vs children speaking and zap them out.
(Audition is pretty good software for this, you might still be able to find the download and mass unlock serial key that Adobe released for version 3.0 somewhere on the internet, of course, this is only for people who bought it as a perpetual license back in the day and need to activate it now that the activation servers have gone offline, so no pirating it!)
I'm not gonna say it will be perfect but you might do well enough to be able to hear what everyone is saying and it not sound too bizarre.
Also looking at frequency spectrum and removing sounds from there I learned that music and spoken both contain a bunch of different frequency. To completely eleminate music, i have to remove all of it which is not easy to see or trace manually.
From what you are saying, given audition is an adobe tool, it should be sufficient.