The complete explanation: from DeepSeek Janus to Janus-Pro!

Take Home Message: Janus is a simple, unified, and extensible multimodal comprehension and generation model that decouples multimodal comprehension and generated visual coding, mitigating potential conflicts between the two tasks. It can be extended to incorporate additional input modalities in the future. Janus-Pro builds on this foundation by optimizing the training strategy (including increasing the number of training steps, adjusting the data ratios, etc.), adding more data (including the use of synthetic data, etc.), and scaling up the model size (to 7 billion parameters), which leads to advances in the model’s multimodal comprehension and text-to-image instruction adherence capabilities.

Code address

Janus Pro address

Janus-Pro is an advanced version of previous work Janus, specifically, including (1) an optimized training strategy, (2) expanded training data, and (3) larger model sizes. With these improvements, Janus-Pro makes significant advances in multimodal understanding and text-to-image instruction adherence capabilities, while also enhancing the stability of text-to-image generation. Before unpacking Janus-Pro, let’s review Janus.

Table of Contents

Reviewing Janus

The predecessor Janus is an autoregressive framework for unified multimodal comprehension and generation, which is used to decouple visual coding for unified multimodal comprehension and generation. For multimodal understanding, the design typically follows LLaVA, using visual coders as a bridge to enable large language models to understand images. For generation, it is usually based on diffusion models, and some are based on autoregressive methods. Some approaches attempt to use a single Transformer to try to unify the multimodal comprehension and generation tasks, which typically uses a single visual coder to process the inputs of both tasks.

However, there are differences in the representations required for multimodal comprehension and generation tasks. In the multimodal understanding task, the visual encoder aims at extracting high-level semantic information (e.g., object categories or visual attributes), and the output involves not only extracting information from the image, but also complex semantic reasoning, with the encoder focusing mainly on high-dimensional semantic representations. The generation task is mainly concerned with generating local details and maintaining global consistency in the image, thus requiring low-dimensional coded representations of spatial structures and texture details. Unifying the representations of both tasks in the same space can lead to conflicts.

Janus contains 2 independent visual coding paths for multimodal comprehension, and generation, and brings two benefits: 1) mitigates conflicts stemming from the different granularity requirements of multimodal comprehension and generation, and 2) is flexible and scalable, decoupling so that both the comprehension and generation tasks can be coded using state-of-the-art coding techniques specific to their domains, and in the future can be fed with point clouds, EEG signals, or audio data, and processed using a unified In the future, point clouds, EEG signals or audio data can be input and processed using a unified Transformer.

For text understanding, text is converted to discrete IDs using LLM’s built-in Tokenizer;

For multimodal understanding, high-dimensional semantic features in the images are extracted using SigLIP encoders (author’s note: Cosmos also uses SigLIP encoders in the Guardrails section), and the extracted features are mapped into the text feature space of LLM using Adaptor (2-layer MLP);

The long side was adjusted to 384 pixels and the short side was filled to 384 pixels using RGB(127, 127, 127);

For visual generation, the image was converted to discrete IDs using the VQ Tokenizer, and each ID was mapped into the textual feature space of the LLM using Adaptor (2-layer MLP);

Short edges were resized to 384 pixels and long edges were cropped to 384 pixels;

Overall training was performed using 16 nodes, each containing 8 Nvidia A100 GPUs;

For both visual generation and multimodal understanding tasks, the image feature sequences and text feature sequences are linked together as input to the LLM (DeepSeek-LLM 1.3B is used in the text);

The built-in prediction head of the LLM is utilized for text predictions in both the pure text understanding and multimodal understanding tasks, while a randomly initialized prediction head is used for image predictions in the visual generation task. The entire model adheres to an autoregressive framework without the need for specially designed attention masks.

Janus training is divided into 3 phases:

Phase 1

Train Adaptor and Image Head to create connections between linguistic and visual elements in the embedding space, enabling the LLM to understand entities in the image and to have initial visual generation capabilities;

For multimodal understanding, use 1.25 million image-text paired caption data from SHareGPT4V in the format: <image><text>;

For visual generation, using 1.2 million samples from ImageNet1k in the format: <category name><image>;

Phase 2

Unified pre-training, using a multimodal corpus for unified pre-training to learn multimodal comprehension and generation. Plain text data, multimodal understanding data and visual generation data are used in this phase. Simple visual generation training using ImageNet-1k, followed by the use of generic text-to-image data to enhance visual generation in the open domain of the model;

Plain text data: DeepSeek-LLM pre-trained corpus;

Interleaved image-text data: WikiHow and WIT datasets;

Image Caption data: Images from multiple sources and re-captioned some of the images using open source multimodal models, with data formatted as Q&A pairs, e.g. <image>Describe the image in detail.<caption>;

Tabular and graphical data: corresponding tabular and graphical data from DeepSeek-VL in the format <question><answer>;

Visually generated data: image-caption pairs from multiple datasets and 2 million internal data;

During training, only the first sentence of the caption is randomly used with a 25% probability;

ImageNet samples appear only in the initial 120K training steps, with images from other datasets appearing in subsequent 60K steps;

Phase 3

Supervised fine-tuning, where pre-trained models are fine-tuned using instruction fine-tuning data to enhance their ability to follow instructions and dialog. Fine-tune all parameters except the generating encoder. Masking system and user cues while supervising answers. To ensure that Janus has proficiency in both multimodal comprehension and generation, the models are not fine-tuned separately for specific tasks. Instead, we use a mix of text-only dialog data, multimodal comprehension data, and visual generation data to ensure versatility in a variety of scenarios;

Text comprehension: uses data from specific sources;

Multimodal comprehension: using data from multiple sources for instruction tuning;

Visual generation: using a subset of image-text pairs from some of the Phase II datasets as well as 4 million internal data;

The data format is: User:<Input Message> \n Assistant: <Response>;

Training Objectives

Janus is an autoregressive model trained using a cross-entropy loss function, for plain text comprehension and multimodal comprehension tasks the loss is computed at the text sequence. For visual generation tasks, the loss is computed only on the image sequence. To keep the design simple, no different loss weights are assigned to the different tasks.

Reasoning

Using the next lexical element prediction method, for plain text comprehension and multimodal comprehension, lexical elements are sampled sequentially from the prediction distribution. For image generation, a classifierless bootstrap is used.

Possible extensions

For multimodal comprehension, 1) a stronger visual coder could be chosen, and 2) dynamic high-resolution techniques could be used;

For vision generation, 1) more fine-grained encoders could be chosen, 2) using loss functions specifically designed for vision generation, and 3) combining causal attention and parallel methods;

More modalities, with the ability to integrate 3D point clouds, haptics, EEG, and other inputs for loss modalities;

Janus-Pro Upgrade

With limited training data and relatively small model capacity (1B), Janus is deficient in some aspects, such as poor representation of image generation under short cues and inconsistent quality of text-to-image generation.The architecture of Janus-Pro is the same as that of Janus, which can be seen in the figure below:

Main Improvements

Training Strategy

Stage 1: Increase the number of training steps and fully train on ImageNet;

Stage 2: No longer use ImageNet, directly use regular text-to-image data for training;

Stage 3: Modify the dataset ratios in the fine-tuning process by changing the ratio of multimodal data, plain text data and text-to-image data from 7:3:10 to 5:1:4;

Data Scale

Multimodal understanding

Stage 2: Add 90 million samples, including YFCC for image captioning and Doc-matrix for table and chart document understanding;

Stage 3: Add DeepSeek-VL2 additional datasets such as MEME understanding;

Visual generation: real-world data may contain poor quality, resulting in unstable text-to-image generation and poor aesthetic output, Janus-Pro uses 72 million samples of synthetic aesthetic data, with a uniform pre-training phase (Stage 2) of 1:1 ratio of real data to synthetic data;

Model Scale

Scale model parameters to 7 billion parameter scale;

Experimental details

Comparing to Janus, the details of Janus-Pro experiments are basically the same. In contrast, the larger-parameter model used more cluster nodes (16 to 32).

Janus-Pro training hyperparameters

Insufficient

For multimodal understanding, the input resolution is limited to 384×384, affecting performance on fine-grained visual tasks. For text-to-image generation, the low resolution results in a lack of detail in the generated results.

The complete explanation: from DeepSeek Janus to Janus-Pro!

Reviewing Janus