Meta Is Developing New Multimodal AI Model Chameleon to Rival OpenAI's GPT-4o

Following the recent AI offerings showdown between OpenAI and Google, Meta's AI researchers seem ready to join the contest with their own multimodal model.

Multimodal AI models are evolved versions of large language models, as they can process various forms of media, such as texts, images, sound recordings, and videos.

For example, you can now open your camera and ask OpenAI's latest GPT-4 AI model to write a description of your surroundings.

Chameleon: Meta's Early-Fusion Approach to Multimodal AI

Facebook-parent Meta is looking to launch a similar tool with its own multimodal model, Chameleon. According to Meta's Chameleon team, the model is a series of 'early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.'

An improvement to an earlier technique called late fusion, Chameleon does not need to process data as separate entities. Using early-fusion architecture, the AI models promise to go beyond the limitations of the late-fusion approach.

TechXplore explains that the team developed a system that seamlessly integrates different data-such as images, text, and code-by converting them into a common set of tokens.

This approach, similar to how large language models process words, allows for advanced computing techniques to be applied to mixed input data.

Using a unified vocabulary, the system can efficiently handle and transform various data types together, enhancing the overall processing and understanding of complex information.

A photo of the META logo during the US social network Instagram opening on a tablet screen in Moscow on November 11, 2021. KIRILL KUDRYAVTSEV/AFP via Getty Images

Meta's Chameleon Outshines Larger Models in Multimodal AI Tasks

Unlike Google's Gemini, Chameleon is an end-to-end model. This means Chameleon handles the entire process from the beginning to the end directly.

The researchers introduced novel training techniques to enable Chameleon to work with diverse token types. This involved a two-stage learning process and a massive dataset comprising roughly 4.4 trillion texts, images, or token pairs, along with interleaved data.

The system underwent training with 7 billion and then 34 billion parameters over an extensive 5 million hours on high-speed GPUs. In comparison, OpenAI's GPT-4 reportedly has 1 trillion parameters.

In a paper posted on the arXiv preprint server, the team shared promising results that the model showed during testing.

The outcome is a multimodal model that exhibits impressive versatility, achieving state-of-the-art performance in image captioning tasks. According to the researchers, this model not only surpasses Llama-2 in text-only tasks but also holds its own against models like Mixtral 8x7B and Gemini-Pro. Additionally, it performs sophisticated image generation, all within a single, unified framework.

They also state that Chameleon matches and even outperforms the performance of much larger models, such as the Gemini Pro and GPT-4, based on certain tests.

Stay posted here at Tech Times.

Tech Times Writer John Lopez