Explosion! DeepSeek‘s Chinese New Year gift—a detailed explanation of the multimodal model Janus-Pro
DeepSeek’s latest Janus-Pro model directly connects the “left and right brains” of multimodal AI!
This two-faced killer, which can simultaneously do image and text understanding and image generation, is rewriting the rules of the industry with its self-developed framework.
This is not a simple superposition of functions, but by decoupling the visual encoding path, the model has achieved true “one mind, two uses”.
Traditional multimodal models are like using the same hand to write and draw, while Janus-Pro directly equips the AI with two neural systems!
Framework revolution: solving the century-old problem of multimodality
The most ruthless innovation of Janus-Pro is to split visual encoding into two independent channels.
It is like equipping the AI with the eye of understanding and the hand of creation, so that the model no longer struggles when processing “picture description” and “text-to-image”.
Its greatest breakthrough lies in its brand new unified architecture design. This architecture consists of three core components:
Autoencoder: as the core language model
SigLIP-L@384: responsible for image understanding encoding
VQ-VAE based on LlamaGen: for image generation
By decoupling the visual encoding into independent paths while maintaining a unified Transformer architecture, Janus-Pro ingeniously solves the role conflict of previous models in the visual encoder.
@reach_vb points out the key breakthrough in the architecture:
The model is built on DeepSeek-LLM-1.5b/7b, uses SigLIP-L to process 384×384 image inputs, and decouples the encoding process through task-specific paths
This design allows the model to seamlessly switch between multimodal tasks while maintaining a single Transformer architecture.
Training strategy: the evolutionary path to three-step success
The DeepSeek team adopted a carefully designed three-stage training process:
Stage 1: Train new parameters on the ImageNet dataset to establish conceptual connections between visual and linguistic elements
Stage 2: Introduce a multimodal hybrid dataset for full parameter fine-tuning
Stage 3: Improve command following and dialogue capabilities through supervised fine-tuning
Innovative adjustments have also been made to the data ratio:
Image understanding task: 50% (a significant increase)
Image generation task: 40
Text task: 10%
@iScienceLuvr points out the secret of training:
The proportion of text tasks was deliberately reduced during the third stage of fine-tuning
This forces the model to focus its computing power on cross-modal conversion
Performance master
This “all-rounder” monster is killing it in the two core metrics!
Official tests show that Janus-Pro not only beats the previous unified model, but can even take on specialized models head-to-head – scoring as high as LLaVA in the comprehension task and outperforming DALL-E 3 in generation quality!
With a GenEval score of 0.8, it puts SD3-Medium to shame
and a DPG-Bench score of 84.19, its visual creation quality is close to that of professional designers
This is based on a training strategy of 72 million synthetic images and three stages of training (adapter training → unified pre-training → supervised fine-tuning), which has literally turned the model into a “multi-modal master”.
@dr_cintas posted a comparison of actual measurements:
Running a 4-bit quantized version on an iPhone, the inference speed is nearly 60 tokens/s
The generated 384×384 thumbnail can actually read the license plate text
In the multimodal understanding benchmark test, Janus-Pro-7B showed amazing strength:
POPE: 87.4%
MME-PT: 1567.1
MMBench: 79.2
SEED: 72.1
MMMU: 41.0
MM-Vet: 50.0
In terms of image generation, the model achieved a GenEval score of 0.8 and a DPG-Bench score of 84.19, surpassing many mainstream models including DALL-E 3 and SD3-Medium.
MIT open source: feel free to play!
DeepSeek has turned the tables this time – the 7B/1B dual version is fully open source, and the MIT license allows commercial modifications!
Hugging Face can be downloaded immediately, and even the 1B lightweight version can be run locally on an iPhone.
Developer @angrypenguinPNG gave a live demonstration:
Enter “future city night scene” and a cyberpunk street view appeared in seconds
Zoom in to examine the details of the scene, and the model can accurately describe the gradient of the neon lights
Practical value: lowering the barrier to entry
To meet the needs of different scenarios, DeepSeek provides two versions:
Janus-Pro-7B: the full version, with powerful performance
Janus-Pro-1B: a lightweight version that can be run directly in the browser
Both versions have been open-sourced on the Hugging Face platform and released under the MIT license, so developers can freely use and modify them.
DeepSeek’s comprehensive breakthrough
Now the most exciting question is: when understanding and generation no longer require two separate models, will the existing AI application architecture be collectively disrupted?
Those who are still struggling with single-modal applications should consider developing collaborative applications for the left and right brains.
After all, a model that can simultaneously play with both text and graphics is the true embodiment of multimodality.
It is worth noting that the release of Janus-Pro is just one of a series of recent major breakthroughs by DeepSeek:
Perplexity has integrated the DeepSeek R1 model for deep web search
DeepSeek R1 distilled version achieves a local inference speed of 60 tokens/s on the iPhone
DeepSeek AI Assistant has jumped to the top of the App Store free list
and demonstrated extremely fast inference performance on the Groq platform.
These achievements demonstrate the comprehensive strength of DeepSeek in the field of AI, and the groundbreaking progress of Janus-Pro has opened up new directions for the development of multimodal AI.
Janus pro Related links and documents
Project address:
Model downloads:
Quick experience:
No deployment, free, online use janus pro
Reference documentation:
Finally, we would like to say: Sam Altman’s company name, the pie he has painted, and the path he has thought about seem to be being passed on to this curiosity-driven Chinese company, which will continue the in-depth exploration of the boundaries of intelligence!