0:00
/
Transcript

Native Multimodal Intelligence: From Language Models to Omni-Modality

Executive Summary

This briefing document synthesizes key insights from a technical presentation by Victoria Lynn (Thinking Machines Lab) regarding the evolution of native multimodal intelligence. The core thesis is that while Large Language Models (LLMs) have achieved breakthroughs via next-token prediction over symbolic information, true artificial intelligence requires the seamless integration of multimodal signals—images, audio, and video—to interact with the physical and digital worlds.

The document explores the shift toward “native” multimodal models that tokenize all information types into a unified transformer architecture. Key highlights include the distinction between discrete tokenization (Chameleon) and hybrid diffusion-autoregressive models (Transfusion), and the introduction of the Mixture-of-Transformers (MoT) architecture. A critical finding discussed is the “transfer gap”: while improved understanding enhances generation, training for generation does not currently improve a model’s understanding capabilities. This suggests that language remains a unique, highly compressed abstraction of human reasoning that sensory data (like video frames) has yet to replicate in terms of learning efficiency.


  1. The Paradigm of Native Multimodality

Native multimodal intelligence moves beyond using language models as simple controllers for external tools. Instead, it aims to build systems where multiple modalities are processed within a single, unified architecture.

Tokenization Across the Board

The state-of-the-art approach treats all incoming information as tokens, regardless of the original modality:

  • Text: Processed via standard methods like Byte Pair Encoding (BPE).

  • Images: Undergo “patchification,” where images are divided into standardized units (e.g., 16x16 pixels). These patches are vectorized, sequentialized, and treated as tokens.

  • Audio: Waveforms are transformed and tokenized into representations that a transformer can process.

  • Video: Treated as a sequence of image frames; patches from all frames are concatenated into a single sequence of tokens.

Categories of Multimodal Models

  1. Multimodal Input / Text Output: Models like Gemini, Quinn, and Kimmy. These excel at understanding multimodal context but communicate their reasoning or answers exclusively through text.

  2. Omni Models: Models like GPT-4o and Chameleon. These are “omni” because they handle multimodal information as both input and output, capable of generating images and audio directly.


  1. Architectural Frameworks for Omni Models

The research landscape is currently divided between two primary philosophies for generating non-text modalities.

The Chameleon Family (Discrete Tokenization)

Chameleon operates on the hypothesis that every modality can be converted into discrete tokens.

  • Mechanism: Uses Vector Quantized Variational Autoencoders (VQ-VAE) to map image patches to a learned vector codebook. This converts an image into a sequence of discrete indices.

  • Training: Employs a standard cross-entropy language modeling objective on interleaved text and image sequences.

  • Limitations: Information loss occurs during discretization, leading to a performance gap in understanding compared to continuous encoders (like SigLIP). It is also token-inefficient, requiring vast amounts of data to sample well-formed images.

The Transfusion Approach (Hybrid Diffusion)

Transfusion attempts to overcome the limitations of discrete tokens by combining two different mathematical objectives.

  • Mechanism: It integrates auto-regressive next-token prediction for text with a diffusion objective for images.

  • Architecture: Text tokens use causal attention, while image segments use bidirectional attention for better representation.

  • Performance: Demonstrates significantly higher image quality and better token efficiency than discrete models. However, a “dilemma” remains: the representations most efficient for image generation (VAE) are often not the most efficient for image understanding.


  1. Innovation: Mixture-of-Transformers (MoT)

To address the differing information densities of various modalities, the Mixture-of-Transformers architecture introduces modality-specific parameters within the transformer backbone.

Structural Components

Instead of a single set of parameters, MoT employs independent sets of transformer parameters (QK projections and Feed-Forward Layers) for each modality:

  • Deterministic Routing: Tokens are routed to specific parameters based on their modality (e.g., a text token activates text parameters; an image token activates image parameters).

  • Joint Attention: Despite separate parameters, all tokens undergo a joint attention process to allow for cross-modality information transfer.

  • Asynchronous Training: This architecture allows developers to “freeze” a strong pre-existing text model and add new modalities (like speech or image generation) by training only the new, modality-specific parameters.

Performance Benefits

Feature MoT Impact Non-Text Generation Substantially improves quality for image and speech generation. Text Performance Maintains the high performance of dense text-only baselines. Training Stability Enhances overall stability during mixed-model training. Scaling Allows customized scaling (e.g., more experts for text, fewer for images).


  1. The Understanding-Generation Transfer Gap

A significant observation in recent research is the asymmetrical relationship between understanding and generation in non-text modalities.

  • Understanding \rightarrow Generation: There is a strong positive transfer. Better understanding and reasoning capabilities allow a model to plan and generate more detailed, accurate images with fewer hallucinations.

  • Generation \rightarrow Understanding: There is little evidence that training a model for image or video generation improves its understanding performance.

The “Language as Abstraction” Hypothesis

The document highlights a “puzzling” phenomenon: next-token prediction in language leads to emergent intelligence, but next-frame prediction in video does not yet produce similar results. Hypothesized reasons include:

  • Compression: Language is a highly compressed abstraction of human reasoning and action.

  • Passive vs. Subjective: Images and video are “passive observations” of the sensory world, whereas language is a “subjective interpretation.”

  • Redundancy: Next-frame prediction in video is often computationally redundant due to high similarity between consecutive frames.


  1. Future Directions and Embodied AI

While current models excel at digital information processing (e.g., understanding PDFs, infographics, and code), they remain far from achieving “physical world intelligence.”

  • Embodied AI: MoT-style architectures are being adopted in robotics to predict “action vectors” as a separate modality, allowing the model’s linguistic knowledge to assist in physical tasks.

  • Spatial Reasoning: Current “patchify and encode” paradigms work well for 2D data like infographics but struggle with spatial-temporal understanding required for real-world navigation.

  • Unification: A major open research question is whether a single representation can eventually capture enough information for perception, reasoning, and generation simultaneously, similar to human cognitive hierarchies.

“Language modeling alone is not enough... the goal is to build AI systems that are not only processing symbolic knowledge but are also able to seamlessly handle multimodal information.” — Victoria Lynn

All my links: https://linktr.ee/learnbydoingwithsteven

#learnbydoingwithsteven #AI #DeepLearning #Research #TechSummary #MachineLearning #LLM #ScalingLaws #NeuralNetworks #Innovation

Discussion about this video

User's avatar

Ready for more?