Apple Unveils New 'MM1' Multimodal AI Model Capable of Interpreting Images, Text Data

Jace Dela Cruz, Tech Times 20 March 2024, 04:03 am

Apple has revealed its latest development in artificial intelligence (AI) large language model (LLM), introducing the MM1 family of multimodal models capable of interpreting both images and text data.

According to Tech Xplore, this unveiling represents Apple's ongoing efforts to enhance its AI capabilities. The MM1 models aim to use multimodal AI to improve tasks such as image captioning, visual question answering, and query learning.

FRANCE-US-TECH-APPLE-IPHONE-HEALTH — This illustration photograph taken on September 13, 2023 in Paris with a macro lens shows reversed information of an iPhone 12 reflected in the Apple logo of an iphone, as French regulators ordered Apple to halt sales of the iPhone 12 and to fix existing handsets for emitting too much electromagnetic radiation.
(Photo : JOEL SAGET/AFP via Getty Images)

What Is a Multimodal Model?

A multimodal model is an AI model capable of processing and interpreting data from multiple modalities or sources. These modalities can include text, images, audio, video, or any other form of data.

Multimodal models integrate information from different modalities to gain a more comprehensive understanding of the input data, enabling them to perform various tasks such as image captioning, visual question answering, and more.

They are instrumental in tasks requiring understanding and processing information from diverse sources simultaneously, leading to more context-aware and accurate interpretations than single-mode AI systems.

Apple Develops MM1: A Multimodal LLM Model

With parameters numbering up to 30 billion, these multimodal models are engineered to process and analyze a variety of data inputs, including images, text, and documents containing both.

By integrating different data modalities, the MM1 models target to achieve a more comprehensive understanding of complex information, potentially leading to more accurate interpretations.

The researchers highlighted one noteworthy feature: MM1's capacity for in-context learning, which enables the model to retain knowledge and context across multiple interactions. This capability enhances the model's adaptability and responsiveness, allowing it to provide more relevant responses to user queries.

Additionally, the MM1 models demonstrate capabilities such as object counting, object identification, and common-sense reasoning, enabling them to offer insights based on image content. This versatility makes the MM1 models suitable for various applications, from image analysis to natural language understanding.

The Family of M1 Models

In the study's abstract, researchers provide insights into the architecture and design choices that have contributed to the MM1 models' reported success.

They emphasize the importance of leveraging diverse pre-training data sources, including image-caption pairs, interleaved image-text data, and text-only documents, to achieve competitive results across various benchmarks.

Furthermore, the researchers underscore the impact of the image encoder and resolution on model performance, highlighting the significance of these components in multimodal AI systems.

By enhancing their approach, the research team has developed a family of multimodal models that excel in pre-training metrics and demonstrate competitive performance on various benchmarks.

"By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks," the researchers said.

"Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting," they added.

The findings of the research team were published in arXiv.