--- title: LLaVA’s IT author: Anonym date: 2023-04-17 00:00:00 +0800 categories: [NeurIPS 2023] tags: [IT Dataset] math: true pin: false --- - Paper: [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) - [GitHub Link](https://llava-vl.github.io) - Publisher: `NeurIPS 2023` - Author Affiliation: `University of Wisconsin–Madison & Microsoft Research & Columbia University` - Type + [x] SFT + [ ] RLHF - Multi-turn + [x] ✔ + [ ] ✖ - Input Modalities $\rightarrow$ Output Modalities
(I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, B: Bounding box, Tab: Table, Web: Web page) + I+T $\rightarrow$ T - Source + `MS-COCO` - Method + `Auto.` - I/V/A Scale + I * `81K` + V * `Not report` + A * `Not report` - Dialog Turn + `2.29` - Instance Scale + `150K`