also, can it do grounding like cogvlm?
either way, great job!
The hope is to be able to get more multimodal models out soon. I'd like to see if we can get Pixtral and Qwen2.5-vl in relatively soon.
How does this address the security concern of filenames being detected and read when not wanted?