MiniGPT-4: Under the Hood

May 4, 2023

OpenAI recently released GPT-4, its latest large language model (LLM). It's more creative, accurate, and reliable than its predecessor. Crucially, it's also multimodal, which means it can work with multiple types ("modes") of data. In GPT-4's case, it can take an image and output text describing what's happening, answer questions, and even reason about the image.

GPT-4 demonstrating image understanding

Unfortunately, OpenAI's paper on GPT-4 shares no details on the model's architecture, so a group of researchers (Zhu, et al) built a multimodal LLM called MiniGPT-4 and open-sourced it. While not as powerful as GPT-4, MiniGPT-4 still shows incredible capabilities and offers insights into how a multimodal LLM can work.

MiniGPT-4 identifying amusing aspects within images

In this article, we'll dive into MiniGPT-4’s architecture and how it was trained. This will give us an idea of where its multimodal power comes from. We’ll also uncover surprising ways we can leverage pre-trained unimodal models and compose them into more powerful multimodal models.

Note: this article assumes you’re familiar with the transformer architecture and its mechanisms including self-attention and cross-attention. If not, please see the references section for primers.


MiniGPT-4's creators theorized that GPT-4's impressive multimodal generative capabilities come from combining a visual encoder with a more advanced LLM.

To test this theory, they took the visual component from a BLIP-2 model and aligned it with the open-source Vicuna LLM.

Depending on your background, that last sentence may have been a lot to unpack. We'll examine each component individually, then compose them to get a clear understanding of MiniGPT-4.

Note: for this article, we use terms such as "multimodal", "language-image", and "vision-language" interchangeably to conform with various source papers.

BLIP-2: Bootstrapping Language-Image Pre-training

A separate group of researchers (Li, et al) at Salesforce had an idea: training vision-language models from scratch is prohibitively expensive. Since there are already open-source pre-trained large language models and pre-trained image encoders, why can't we combine them to create vision-language models that can take images and output text about them?

It's an intuitive idea that should work, right? The challenge is that the LLM and image encoder are both unimodal models. This means:

  • LLMs know a lot about language but weren't exposed to any visual data during pre-training.
  • Image encoders know a lot about images but weren't exposed to any language data during pre-training.

The researchers needed to align the two modalities (make image representations and language representations comparable to each other).

Some alignment mechanism was needed to bridge the vision and language modalities

To make it even more challenging, the researchers wanted to freeze each model. That is, refrain from updating their parameters to reduce computation cost and avoid catastrophic forgetting (a model forgetting its previously learned information).

What they came up with is BLIP-2: a strategy to efficiently pre-train vision-language models by aligning frozen unimodal models.

Bridging the modality gap

For alignment, the Salesforce researchers proposed a Querying Transformer, or Q-Former, a lightweight transformer that learns to extract the most useful visual features from a frozen image encoder, and feed them to an LLM to generate the desired text.

Note: For BLIP-2, the researchers used a Vision Transformer (ViT) for the image encoder (there's a link to learn more about ViTs in the references section at the end of this article).

The Q-Former consists of two transformer submodules:

  1. An image transformer that connects to the image encoder for visual feature extraction. The image encoder consumes an image and generates embeddings. Each embedding represents different parts of the image. The image transformer then learns to extract the most useful information from this collection.
  2. A text transformer that acts as a text encoder (to take in text) and a text decoder (to generate text).

Critically, these two transformers share the same self-attention layers, allowing them to interact and exchange information about their modalities.

For input, the image transformer takes a set number of 32 learnable query embeddings. These queries interact with the text transformer through the self-attention layers, and the image encoder through the cross-attention layers. Note how an LLM is not yet in the picture.

The Q-Former is pre-trained in two stages.

The first stage is the representation learning stage, where the Q-Former is connected to a frozen image encoder and pre-trained using image-text pairs (see the BLIP-2 paper for dataset details). The goal at this stage is to train the Q-Former so that the queries learn to extract visual representations that are most informative of the text.

To do this, the Salesforce researchers jointly optimized three pre-training objectives. Each objective uses a different masking strategy to control how the image and text transformers interact.

Let's break them down.

Image-Text Contrastive Learning (ITC): the objective is to align image and text representations to maximize their mutual information. Put simply, the Q-Former should learn to generate query representations that are closely related to the text representation.

To achieve this, ITC contrasts the similarity between a positive image-text pair (an image and its corresponding text) against the similarities of negative pairs (an image and non-corresponding text).

Since the image transformer outputs multiple query embeddings Z, and the text transformer outputs a single embedding t (the text transformer's [CLS] token), a pairwise similarity between Z and t is calculated, and the highest value is taken as the similarity.

A unimodal self-attention mask prevents the queries and text from directly communicating.

ITC encourages the Q-Former to learn query representations that are closely related to the text.

Image-grounded Text Generation (ITG): the objective is to train the Q-Former to generate text based on input images. Since the frozen image encoder and the text transformer can't interact, the Q-Former must learn to extract the necessary information from images (through its queries) to generate the text.

For example, if we have an image of a cat sitting on a couch as input, the ITG objective will be to generate text such as "a cat sitting on a couch". To do this, the Q-Former must learn queries that extract visual features such as the cat, the couch, and their positions.

The masking strategy used here is a multimodal causal self-attention mask. This mask allows queries to attend to each other but not attend to the text, whereas the text can attend to all queries and its previous text tokens (i.e. like an autoregressive decoder).

To make the text transformer act like a decoder, the [CLS] token is replaced with a [DEC] token to signal a text generation task.

Image-Text Matching (ITM): the objective is to improve the fine-grained alignment between image and text representations. The Q-Former learns to classify image-text pairs as either positive (the image and text match) or negative (they don't match).

A bidirectional self-attention mask allows all queries and texts to attend to each other, which enables the output queries to capture multimodal information.

Each output query embedding is fed into a two-class linear classifier to obtain a logit (a score indicating the likelihood of a match). These logits are averaged across all queries to calculate a final matching score for the pair.

So those are the three objectives of the Q-Former's first pre-training stage. To summarize:

  1. Image-Text Contrastive Learning (ITC) focuses the Q-Former on learning high-level alignment between image and text representations.
  2. Image-grounded Text Generation (ITG) trains the Q-Former to reconstruct text descriptions based on input images.
  3. Image-Text Matching (ITM) further trains the Q-Former to understand the relationships between image features and text descriptions at a more granular level.

In addition, because the Q-Former's learnable query embeddings are smaller (32 query embeddings x 768 dimensions) than the outputs of the frozen image encoder, the Q-Former’s queries act as an information bottleneck that forces the queries to extract the most relevant visual features.

This combination of objectives and information bottleneck promotes a robust understanding of the relationships between image and text modalities, ultimately leading to improved performance on downstream vision-language tasks.

After this first stage, the Q-Former then undergoes the generative pre-training stage. This is where an LLM comes in.

In this second stage, the goal is to use the LLM's generative capabilities to further improve the Q-Former's ability to align visual and text information.

To do this, the Q-Former's output query embeddings are linearly projected into the same dimension as the LLM's text embeddings. These query embeddings are then prepended to the input text embeddings, effectively functioning as "soft visual prompts".

Stage two of Q-Former pre-training (decoder-only LLM showed here)

Intuitively, the first stage trained the Q-Former to generate language-informed visual representations. These distilled visual representations (remember the information bottleneck that removes irrelevant features) condition the LLM’s output and reduces any burden on the LLM to learn vision-language alignment.

The researchers experimented with two types of LLMs: decoder-only LLMs (e.g. OPT), which generated text conditioned on Q-Former output, and encoder-decoder LLMs (e.g. FlanT5), which included some training text appended to the Q-Former output.

After these two pre-training stages, the result was a vision-language model which achieved state-of-the-art results with only a fraction of the parameters (188M) and corresponding computing cost.

BLIP-2 beat models magnitudes larger at diverse multimodal tasks, including visual question answering, image captioning, and image-text retrieval (see the BLIP-2 paper for examples and benchmarks). It offers a flexible architecture, constraints that mitigate catastrophic forgetting, and effective vision-language alignment in a resource-efficient manner.

Visual question answering and reasoning with a BLIP-2 model (ViT-G + FlanT5XXL)

Vicuna: MiniGPT-4's LLM of choice

Ok, that's BLIP-2, and it's what MiniGPT-4 uses for generating language-informed visual representations from images.

Let's briefly cover Vicuna, the open-source chat LLM used by MiniGPT-4.

Vicuna (Chiang, et al)  is a 13B parameter LLM that generates output within 90% quality of ChatGPT and Bard. How was this quality judged? The authors let GPT-4 be the judge! (you can read more about it along with GPT-4's evaluation remarks through the link in the reference section).

But perhaps most impressive is that Vicuna cost only around $300 to train. Vicuna was created by fine-tuning LLaMA, an LLM from Meta (Touvron, et al), which comes in 7B, 13B, 33B, and 65B variants.

The team behind Vicuna fine-tuned LLaMA on conversations from ShareGPT (ChatGPT conversations shared by users). Like other LLMs, Vicuna is limited in reasoning, mathematics, and ensuring the accuracy of its output.

MiniGPT-4: Surprising simplicity, impressive results.

Ok, so here's what we have so far:

  • BLIP-2: a strategy to create vision-language models from frozen unimodal models by pre-training a Q-Former.
  • Vicuna: an open-source chatbot that generates output within 90% quality of ChatGPT.

The researchers behind MiniGPT-4 saw that previous vision-language models rarely exhibited the advanced multimodal capabilities of GPT-4. While the models could describe images at a high level and answer questions about them, they struggled with advanced tasks such as detailed image descriptions and generating websites from hand-written drafts.

Here’s an example from a model created through BLIP-2. While the model correctly identifies the main image subject, it doesn’t output the level of detail the human asks for:

BLIP-2 model performs well for high-level descriptions but lacks detail

The researchers theorized that these models needed a more advanced LLM. To that end, they built MiniGPT-4 by:

  1. Taking a pre-trained Q-Former attached to a ViT image encoder (ViT-G/14 from EVA-CLIP). This Q-Former had already been trained to extract relevant visual features from the image encoder.
  2. Taking the Vicuna LLM.
  3. Freezing both.
  4. Aligning them with just one projection layer to translate the Q-Former query outputs to Vicuna-compatible inputs.

The team trained this setup on approximately 5 million image-text pairs (see the MiniGPT-4 paper for dataset information), with the output from the Q-Former fed to Vicuna as a soft prompt. Vicuna's output was then compared against the ground-truth text.

This training took ten hours on four A100 GPUs.

After this first round of training, the resulting model fell short of qualitative goals. While the model offered reasonable responses to human inquiries, it sometimes produced repetitive words and sentences, or irrelevant content.

The team further fine-tuned the model on a curated collection of 3,500 image-text pairs and a conversational template to make responses more natural.

Remarkably, they created this curated collection using MiniGPT-4 itself and with help from ChatGPT.

They asked MiniGPT-4 to generate image descriptions for 5,000 images (prompting it for more when the descriptions were shorter than 80 tokens). Because this was before the model was fine-tuned, the image descriptions tended to have repetitive words or incoherent statements. To address these issues, they used a specific prompt to ask ChatGPT to refine the descriptions, remove errors, eliminate unnecessary repetition, and rewrite incomplete sentences.

A final manual verification by humans whittled down the collection from 5,000 to 3,500.

During fine-tuning, MiniGPT-4 was fed pre-defined prompts using this template:

###Human: <Img>{ImageFeatures}</Img>{Instruction} ###Assistant:

{Instruction} here represents a randomly sampled instruction from a predefined set of instructions, like "Describe this image in detail" or "Could you describe the contents of this image for me."

The result of this was more natural and capable responses. Fine-tuning itself took only seven minutes on a single A100 GPU.

MiniGPT-4 excels at detailed image descriptions

The MiniGPT-4 paper contains a number of other examples and the project page has a demo link you can try yourself. Here are a few of the examples the team put together:

Multimodality in production is a hard problem

The MiniGPT-4 team showed that combining a pre-trained vision component and an advanced LLM through a single projection layer can lead to incredible results. They also demonstrated how composable these models are, and even how we can use models to train and evaluate other models!

It's clear that multimodal LLMs are the future, but using them in industry is still challenging.

The first hurdle is the expertise required to understand and keep up with the blazing pace of development in the space. If your company already has the in-house talent, then great, but your primary focus may be elsewhere, or the cost in time and money to build a team from scratch may be off-putting.

The model itself is just one (small) component. To make cutting-edge multimodal LLMs work in production, you need to address:

  • Model selection. Which models will you use? How will you combine them? Will you use an open-source model or something behind an API?
  • Technology selection which includes databases, servers, and data engineering tooling.
  • Data ingestion. How will you transform your data (in multiple modalities) to make it compatible with your models?
  • Configuration, deployment, and monitoring. How will you configure all your systems, put them in production, and ensure they're working as expected?
  • Performance. How do you ensure low-latency responses to keep experiences snappy? How will you scale your infrastructure cost-effectively as demand increases? Where, how, and what will you cache?
  • Ongoing model training based on feedback. As your users interact with your models, you'll want to fine-tune the models to serve your users better. How will that be done without jeopardizing current performance?
  • Evaluating model results. How will you know whether your model outputs are helping meet business goals?

We're all about solving these headaches.

At Kailua Labs, we've built a turnkey system that understands your media and helps you with search, discovery, recommendations, enrichment, and building next-generation multimodal apps on top of those capabilities.

Our platform handles more than just images. Our content understanding engine is at the core of our platform, which works on text, images, video, audio, and anything else encoded in bits. Our platform powers everything from semantic podcast search, next-generation e-commerce search, and even AI stylists. Check out our demo using stock images (search using plain language descriptions!).

Multimodality is the future. If you’re interested in getting on board, let’s chat!


OpenAI GPT-4 product page (links to paper, demos)
MiniGPT-4 project page (links to paper, demos, code)
BLIP-2 from Salesforce Research (links to paper, code)
Vicuna project page (links to code, demo)
LLaMA from MetaAI (links to paper)
The Illustrated Transformer
Transformers From Scratch (video)
Intro to Vision Transformers

© 2023 Kailua Labs