--- title: MosIT author: Anonym date: 2023-09-11 00:00:00 +0800 categories: [ICLR 2024] tags: [IT Dataset] math: true pin: false --- - Paper: [NExT-GPT: Any-to-Any Multimodal LLM](https://arxiv.org/abs/2309.05519) - [GitHub Link](https://next-gpt.github.io/) - Publisher: `ICLR 2024` - Author Affiliation: `National University of Singapore` - Type + [x] SFT + [ ] RLHF - Multi-turn + [x] ✔ + [ ] ✖ - Input Modalities $\rightarrow$ Output Modalities
(I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, B: Bounding box, Tab: Table, Web: Web page) + I+V+A+T $\rightarrow$ I+V+A+T - Source + `Youtube, Google, Flickr30k, Midjourney, etc.` - Method + `Auto. + Manu.` - I/V/A Scale + I * `4K` + V * `4K` + A * `4K` - Dialog Turn + `4.8` - Instance Scale + `5K`