After asserting Gemma 2 at I/O 2024 in May, Google at the moment is introducing PaliGemma 2 as its newest open vision-language mannequin (VLM).
The first model of PaliGemma launched in May to be used circumstances like captioning photographs and quick video, understanding textual content in photographs, object detection, object segmentation, and “visible query answering.”
PaliGemma 2 now touts “lengthy captioning” with the flexibility to generate “detailed, contextually related captions for photographs, going past easy object identification to explain actions, feelings, and the general narrative of the scene.” Available mannequin sizes embody 3B, 10B, 28B parameters, in addition to 224px, 448px, and 896px resolutions.
There’s additionally “correct optical character recognition and understanding the construction and content material of tables in paperwork.” Google has discovered PaliGemma 2 to supply main efficiency in chemical system recognition, music rating recognition, spatial reasoning, and chest X-ray report era.
Google says PaliGemma 2 is designed to be a “drop-in substitute” for these utilizing the unique mannequin. Developers ought to profit from “rapid efficiency positive factors on most duties with out main code modifications.” Another touted profit is how straightforward it’s to fine-tune on your particular duties.
Pre-trained fashions and code for PaliGemma 2 can be found at the moment in Kaggle, Hugging Face, and Ollama.
FTC: We use revenue incomes auto affiliate hyperlinks. More.