---
title: BLIP-2
author: Anonym
date: 2023-01-30 00:00:00 +0800
categories: [ICML 2023]
tags: [MM-LLMs]
math: true
pin: false
---
- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)
- [GitHub Link](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
- Publisher: `ICML 2023`
- Author Affiliation: `Salesforce Research`
- Functional Division
+ [x] Understanding
+ [ ] Generation
- Design Division
+ [ ] Tool-using
+ [x] End-to-end
- Input Modalities $\rightarrow$ Output Modalities
(I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, ID: Document understanding, IB: Output bounding box, IM: Output segmentation mask, IR: Output retrieved images)
+ I+T $\rightarrow$ T
- Model Architecture
(Input $\rightarrow$ Modality Encoder $\rightarrow$ Input Projector $\rightarrow$ LLM Backbone $\rightarrow$ Output Projector $\rightarrow$ Modality Generator $\rightarrow$ Output)
+ Modality Encoder
* `I: CLIP/Eva-CLIP ViT@224`
+ Input Projector
* `Q-Former w/ Linear Projector`
+ LLM Backbone
* `Flan-T5/OPT`
+ Output Projector
* `None`
+ Modality Generator
* `None`
- Datasets Scale
+ Pre-training Stage
* `129M`
+ Instruction-tuning Stage
* `Not report`