Meta has released its first open-source model with both image and text processing abilities, two months after the release of its last big AI model. The Llama 3.2 model includes both small and medium-sized variants at 11-billion and 90-billion parameters as well as more lightweight text-only models at 1-billion and 3-billion parameters that fit into select mobile and edge devices. 

The models will help developers create more advanced AI applications like AR apps with a real-time understanding of video, visual search engines that distribute images based on content, or document analysis tools that can summarise large portions of text.

Out of the different Llama 3.2 variants, the 11-billion parameter one and the 90-billion parameter one are vision models and can understand charts and graphs, caption images and locate objects from natural language prompts. The bigger model can also pinpoint details from images to create captions. 

While the lightweight models are text-only meant to work on phones with Qualcomm, MediaTek, and other Arm hardware. These are meant for summarising recent messages, sending calendar invites for meetings, and for developers to build personalised agentic apps on them. 

Although Meta is playing catch-up with multimodal models from rival AI companies, the Llama 3.2 is comparable with Anthropic’s Claude 3 Haiku and OpenAI’s GPT4o-mini at image recognition and visual understanding tasks, said Meta. 

The company said Llama 3.2 outperformed Gemma and Phi 3.5-mini at certain tasks like prompt rewriting, instruction following and summarisation. 

Published - September 26, 2024 02:50 pm IST