also, can it do grounding like cogvlm?
either way, great job!
The hope is to be able to get more multimodal models out soon. I'd like to see if we can get Pixtral and Qwen2.5-vl in relatively soon.
Is there any more specific info available about who (llama.cpp or Ollama) removed what, where? As far as I can see, the server is still part of llama.cpp.
And more generally: Is this the moment when Ollama and Llama part ways?
"what are the coordinates of the bounding box for the rubber duck in the image [img]" >>> "[10,50,200,300]"
Ran the 11B yesterday and it worked great.
Still, it seems to understand what's in the images in general (cones and spheres and cubes), and the fact that it runs on my mac book at all is basically amazing.
I think that the faux 3d of clevr images is too much for the model, it's interesting because much smaller pre-transformer specialist models were very good at clevr.
How does this address the security concern of filenames being detected and read when not wanted?