在AI盛起的当下,各类AI应用不断地出现在人们的视野中,AI正在重塑着各行各业。笔者认为,如果说ChatGPT引领了AI革命的开端,那么多模态大模型一定代表着AI应用的未来。
本文是一个多模态大语言模型的资源库,里面罗列了大大小小很多个多模态大语言模型的论文、应用、数据集等学习资源,建议大家点赞收藏。
对于本文中的部分项目,笔者之前也有文章介绍,部分罗列如下,感兴趣的同学可以查看:
Audiocraft——一个基于PyTorch的AI音频生成深度学习研究库
Recognize_Anything-Tag2Text——一款强大的图像标签模型和Tag2Text
......
🔥🔥🔥 这是一个精心策划的 多模态大型语言模型(MLLM) 列表,包括 数据集 、 多模态指令调整 、 多模态情境学习 、 多模态思维链条 、 由LLM辅助的视觉推理 、 基础模型 ,以及 其他 。
🔥🔥🔥 这个列表会实时更新。
🔥🔥🔥 MLLM的综述论文正在准备中,很快就会发布!
目录
• 超棒的论文[1] •多模态指令调整[2] •多模态情境学习[3] •多模态思维链条[4] •由LLM辅助的视觉推理[5] •基础模型[6] •其他[7]
• 超棒的数据集[8] •对齐预训练的数据集[9] •多模态指令调整的数据集[10] •情境学习的数据集[11] •多模态思维链条的数据集[12] •其他[13]
下面的部分论文笔者有中文版,有需要的可以联系笔者获取。
多模态指导调优
| 标题 | 发布会议 | 日期 | 代码 | 演示 | |
Star
Macaw-LLM: 图像,音频,视频和文本整合的多模态语言模型 [14] | arXiv | 2023-06-15 | Github[15] | 即将到来[16] | |
Star
LAMM: 语言辅助的多模态指导调优数据集,框架和基准 [17] | arXiv | 2023-06-11 | Github[18] | 演示[19] | |
Star
Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解 [20] | arXiv | 2023-06-08 | Github[21] | 演示[22] | |
Star
MIMIC-IT: 多模态上下文指导调优 [23] | arXiv | 2023-06-08 | Github[24] | 演示[25] | | M3IT: 面向多模态多语言指导调优的大规模数据集 [26] | arXiv | 2023-06-07 | - | - | |
Star
Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型 [27] | arXiv | 2023-06-05 | Github[28] | 演示[29] | |
Star
LLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手 [30] | arXiv | 2023-06-01 | Github[31] | - | |
Star
GPT4Tools:通过自我指导教大型语言模型使用工具 [32] | arXiv | 2023-05-30 | Github[33] | Demo[34] | |
Star
PandaGPT:一种用于全面指令跟随的模型 [35] | arXiv | 2023-05-25 | Github[36] | Demo[37] | |
Star
ChatBridge:通过大型语言模型作为语言催化剂来桥接模式 [38] | arXiv | 2023-05-25 | Github[39] | - | |
Star
简便快捷:大型语言模型的高效视觉语言指令调优 [40] | arXiv | 2023-05-24 | Github[41] | 本地演示 | |
Star
DetGPT:通过推理检测你需要的东西 [42] | arXiv | 2023-05-23 | Github[43] | Demo[44] | |
Star
VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器 [45] | arXiv | 2023-05-18 | Github[46] | Demo[47] | |
Star
Listen, Think, and Understand [48] | arXiv | 2023-05-18 | Github[49] | Demo[50] | |
Star
VisualGLM-6B | - | 2023-05-17 | Github[51] | 本地演示 | |
Star
PMC-VQA: 医疗视觉问答的视觉指导优化 [52] | arXiv | 2023-05-17 | Github[53] | - | |
Star
InstructBLIP: 通过指导优化实现通用的视觉语言模型 [54] | arXiv | 2023-05-11 | Github[55] | 本地演示 | |
Star
VideoChat: 以聊天为中心的视频理解 [56] | arXiv | 2023-05-10 | Github[57] | Demo[58] | |
Star
MultiModal-GPT: 用于与人类对话的视觉和语言模型 [59] | arXiv | 2023-05-08 | Github[60] | Demo[61] | |
Star
X-LLM: 通过将多模态视为外语来引导先进的大型语言模型 [62] | arXiv | 2023-05-07 | Github[63] | - | |
Star
LMEye: 用于大型语言模型的交互式感知网络 [64] | arXiv | 2023-05-05 | Github[65] | 本地演示 | |
Star
LLaMA-Adapter V2: 高效参数的视觉指导模型 [66] | arXiv | 2023-04-28 | Github[67] | Demo[68] | |
Star
mPLUG-Owl: 模块化使大型语言模型具备多模态能力 [69] | arXiv | 2023-04-27 | Github[70] | Demo[71] | |
Star
MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解 [72] | arXiv | 2023-04-20 | Github[73] | - | |
Star
Visual Instruction Tuning [74] | arXiv | 2023-04-17 | GitHub[75] | Demo[76] | |
Star
LLaMA-Adapter: 使用零初始化注意力高效微调语言模型 [77] | arXiv | 2023-03-28 | Github[78] | Demo[79] | |
Star
MultiInstruct: 通过指导调整提高多模态零样本学习 [80] | ACL | 2022-12-21 | Github[81] | - |
中文版论文
笔者整理了部分论文的中文版,有需要的可以私聊笔者获取,大概效果如下:
多模态上下文学习
| Title | Venue | Date | Code | Demo | |
Star
MIMIC-IT: 多模态上下文中的指导调整 [82] | arXiv | 2023-06-08 | Github[83] | Demo[84] | |
Star
Chameleon: 使用大型语言模型进行即插即用的组合推理 [85] | arXiv | 2023-04-19 | Github[86] | Demo[87] | |
Star
HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务 [88] | arXiv | 2023-03-30 | Github[89] | Demo[90] | |
Star
MM-REACT: 用于多模态推理和操作的ChatGPT提示 [91] | arXiv | 2023-03-20 | Github[92] | Demo[93] | |
Star
利用答案启发的提示为基于知识的视觉问答提供支持 [94] | CVPR | 2023-03-03 | Github[95] | - | |
Star
视觉编程:无需训练的组合视觉推理 [96] | CVPR | 2022-11-18 | Github[97] | Local Demo | |
Star
GPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究 [98] | AAAI | 2022-06-28 | Github[99] | - | |
Star
Flamingo:一种用于少样本学习的视觉语言模型 [100] | NeurIPS | 2022-04-29 | Github[101] | 演示[102] | | 冻结语言模型的多模态少样本学习 [103] | NeurIPS | 2021-06-25 | - | - |
多模态思维链
| 标题 | 会议/期刊 | 日期 | 代码 | 演示 | |
Star
EmbodiedGPT: 通过多模态思维链进行视觉语言预训练 [104] | arXiv | 2023-05-24 | Github[105] | - | | 让我们逐帧思考:通过视频补全和预测评估视频思维链 [106] | arXiv | 2023-05-23 | - | - | |
Star
Caption Anything: 利用多样的多模态控制进行交互式图像描述 [107] | arXiv | 2023-05-04 | Github[108] | 演示[109] | | 视觉思维链:用多模态补全填补逻辑间隙 [110] | arXiv | 2023-05-03 | 即将推出[111] | - | |
Star
Chameleon: 使用大型语言模型进行即插即用的组合推理 [112] | arXiv | 2023-04-19 | Github[113] | 演示[114] | | 视觉语言模型中的思维链提示微调 [115] | arXiv | 2023-04-16 | 即将推出 | - | |
Star
MM-REACT: 多模态推理与交互式ChatGPT [116] | arXiv | 2023-03-20 | Github[117] | 演示[118] | |
Star
Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑 [119] | arXiv | 2023-03-08 | Github[120] | 演示[121] | |
Star
多模态思维链推理 [122] | arXiv | 2023-02-02 | Github[123] | - | |
Star
视觉编程:无需训练的组合视觉推理 [124] | CVPR | 2022-11-18 | Github[125] | 本地演示 | |
Star
学会解释:通过思维链进行多模态推理解答科学问题 [126] | NeurIPS | 2022-09-20 | Github[127] | - |
LLM辅助的视觉推理
| 标题 | 会议 | 日期 | 代码 | 演示 | |
Star
GPT4Tools: 通过自我教育教授大型语言模型使用工具 [128] | arXiv | 2023-05-30 | Github[129] | 演示[130] | |
Star
LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成 [131] | arXiv | 2023-05-24 | Github[132] | - | |
Star
IdealGPT: 通过大型语言模型迭代分解视觉和语言推理 [133] | arXiv | 2023-05-24 | Github[134] | 本地演示 | |
Star
Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令 [135] | arXiv | 2023-05-10 | Github[136] | - | |
Star
Caption Anything: 多样多模态控制的交互式图像描述 [137] | arXiv | 2023-05-04 | Github[138] | 演示[139] | |
Star
Chameleon: 大型语言模型的即插即用组合式推理 [140] | arXiv | 2023-04-19 | Github[141] | 演示[142] | |
Star
HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务 [143] | arXiv | 2023-03-30 | Github[144] | 演示[145] | |
Star
MM-REACT: 多模态推理和行动中的ChatGPT提示 [146] | arXiv | 2023-03-20 | Github[147] | 演示[148] | |
Star
ViperGPT: 通过Python执行进行视觉推理 [149] | arXiv | 2023-03-14 | Github[150] | 本地演示 | |
Star
ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问 [151] | arXiv | 2023-03-12 | Github[152] | 本地演示 | |
Star
Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑 [153] | arXiv | 2023-03-08 | Github[154] | 演示[155] | |
Star
Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器 [156] | CVPR | 2023-03-03 | Github[157] | - | |
Star
PointCLIP V2: 适应强大的3D开放世界学习的CLIP [158] | CVPR | 2022-11-21 | Github[159] | - | |
Star
Visual Programming: 无需训练的组合式视觉推理 [160] | CVPR | 2022-11-18 | Github[161] | 本地演示 | |
Star
Socratic Models: 使用语言进行零样本多模态推理 [162] | arXiv | 2022-04-01 | Github[163] | - |
基础模型
| 标题 | 发表会议/期刊 | 日期 | 代码 | 演示 | |
Star
Transfer Visual Prompt Generator across LLMs [164] | arXiv | 2023-05-02 | Github[165] | 演示[166] | | GPT-4 技术报告 [167] | arXiv | 2023-03-15 | - | - | | PaLM-E: 一种具有身体感知的多模态语言模型 [168] | arXiv | 2023-03-06 | - | 演示[169] | |
Star
Prismer: 具有多个专家的视觉语言模型 [170] | arXiv | 2023-03-04 | Github[171] | 演示[172] | |
Star
语言并非唯一需求:将感知与语言模型对齐 [173] | arXiv | 2023-02-27 | Github[174] | - | |
Star
BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练 [175] | arXiv | 2023-01-30 | Github[176] | 演示[177] | |
Star
VIMA: 基于多模态提示的通用机器人操作 [178] | ICML | 2022-10-06 | Github[179] | |
其他
| 标题 | 发表会议/期刊 | 日期 | 代码 | 演示 | | 大型预训练模型能帮助视觉模型处理感知任务吗? [180] | arXiv | 2023-06-01 | 即将推出[181] | - | |
Star
多模态大型语言模型在上下文目标检测中的应用 [182] | arXiv | 2023-05-29 | Github[183] | 演示[184] | |
Star
利用多模态语言模型生成图像 [185] | arXiv | 2023-05-26 | Github[186] | - | |
Star
评估大型视觉-语言模型的对抗鲁棒性 [187] | arXiv | 2023-05-26 | Github[188] | - | |
Star
在大型视觉-语言模型中评估对象虚构 [189] | arXiv | 2023-05-17 | Github[190] | - | |
Star
将语言模型与图像进行模态间输入输出关联 [191] | ICML | 2023-01-31 | Github[192] | 演示[193] |
用于对齐的预训练数据集
| 名称 | 论文 | 类型 | 模态 | | MS-COCO | Microsoft COCO: Common Objects in Context[194] | 标题 | 图像-文本 | | SBU Captions | Im2Text: Describing Images Using 1 Million Captioned Photographs[195] | 标题 | 图像-文本 | | Conceptual Captions | Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning[196] | 标题 | 图像-文本 | | LAION-400M | LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs[197] | 标题 | 图像-文本 | | VG Captions | Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[198] | 标题 | 图像-文本 | | Flickr30k | Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[199] | 标题 | 图像-文本 | | AI-Caps | AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding[200] | 标题 | 图像-文本 | | 悟空标注 | Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark[201] | 标题 | 图像-文本 | | Youku-mPLUG | Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks[202] | 标题 | 视频-文本 | | MSR-VTT | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language[203] | 标题 | 视频-文本 | | Webvid10M | Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval[204] | 标题 | 视频-文本 | | WavCaps | WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research[205] | 标题 | 音频-文本 | | AISHELL-1 | AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline[206] | ASR | 音频-文本 | | AISHELL-2 | AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale[207] | ASR | 音频-文本 | | VSDial-CN | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[208] | ASR | 图像-音频-文本 |
多模态指令调整数据集
| 名称 | 论文 | 链接 | 备注 | | Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration[209] | 链接[210] | 一个大规模的多模态指令数据集,具有多轮对话 | | LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[211] | 链接[212] | 一个全面的多模态指令调整数据集 | | Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[213] | 链接[214] | 10万个高质量视频指令数据集 | | MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning[215] | 即将推出[216] | 多模态上下文指令调整 | | M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[217] | 链接[218] | 大规模、广覆盖的多模态指令调整数据集 | | LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day[219] | 即将推出[220] | 一个大规模、广覆盖的生物医学指令跟随数据集 | | GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction[221] | 链接[222] | 与工具相关的指令数据集 | | MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst[223] | 即将推出[224] | 覆盖16种多模态任务的多模态指令调整数据集 | | DetGPT | DetGPT: Detect What You Need via Reasoning[225] | 链接[226] | 一个包含5000张图像和约30000个查询-回答对的指令调整数据集 | | PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering[227] | 即将推出[228] | 大规模的医学视觉问答数据集 | | VideoChat | VideoChat: Chat-Centric Video Understanding[229] | 链接[230] | 以视频为中心的多模态指令数据集 | | X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[231] | 链接[232] | 中文多模态指令数据集 | | LMEye | LMEye: An Interactive Perception Network for Large Language Models[233] | 链接[234] | 一个多模态指令调整数据集 | | cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models[235] | 链接[236] | 用于改善模型可用性和生成流畅性的多模态对齐数据集 | | LLaVA-Instruct-150K | Visual Instruction Tuning[237] | 链接[238] | 由GPT生成的多模态指令跟随数据 | | MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning[239] | 链接[240] | 第一个多模态指令调整基准数据集 |
在上下文学习中的数据集
| 名称 | 论文 | 链接 | 备注 | | MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning[241] | 即将推出[242] | 多模态上下文指令数据集 |
在多模态思维链中的数据集
| 名称 | 论文 | 链接 | 备注 | | EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought[243] | 即将推出[244] | 大规模的具身化规划数据集 | | VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction[245] | 即将推出 | 用于评估VideoCOT的推理时间数据集 | | ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering[246] | 链接[247] | 大规模的多项选择数据集,涵盖了多模态科学问题和多个领域 |
其他数据集
| 名称 | 论文 | 链接 | 备注 | | IMAD | IMAD: IMage-Augmented multi-modal Dialogue[248] | 链接[249] | 多模态对话数据集 | | LAMM-Benchmark | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[250] | 链接[251] | 用于评估MLLM在各种2D/3D视觉任务上的定量性能的基准测试 | | OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality[252] | 链接[253] | 用于评估多种能力的数据集 | | Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[254] | 链接[255] | 用于视频对话模型的定量评估框架 | | LVLM-eHub | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models[256] | 链接[257] | MLLM的评估平台 | | CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[258] | 链接[259] | 用于学习拒绝指令的合成多模态微调数据集 | | Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[260] | 链接[261] | 用于学习拒绝指令的手动拍摄的多模态微调数据集 |
文章内容主要翻译整理自:GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models[262],后续会持续更新,请点赞收藏!
References
[1]
超棒的论文: #超棒的论文
[2]
多模态指令调整: #多模态指令调整
[3]
多模态情境学习: #多模态情境学习
[4]
多模态思维链条: #多模态思维链条
[5]
由LLM辅助的视觉推理: #由llm辅助的视觉推理
[6]
基础模型: #基础模型
[7]
其他: #其他
[8]
超棒的数据集: #超棒的数据集
[9]
对齐预训练的数据集: #对齐预训练的数据集
[10]
多模态指令调整的数据集: #多模态指令调整的数据集
[11]
情境学习的数据集: #情境学习的数据集
[12]
多模态思维链条的数据集: #多模态思维链条的数据集
[13]
其他: #其他-1
[14]
Macaw-LLM: 图像,音频,视频和文本整合的多模态语言模型 : https://arxiv.org/pdf/2306.09093.pdf
[15]
Github: https://github.com/lyuchenyang/Macaw-LLM
[16]
即将到来:
[17]
LAMM: 语言辅助的多模态指导调优数据集,框架和基准 : https://arxiv.org/pdf/2306.06687.pdf
[18]
Github: https://github.com/OpenLAMM/LAMM
[19]
演示: https://huggingface.co/spaces/openlamm/LAMM
[20]
Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解 : https://arxiv.org/pdf/2306.05424.pdf
[21]
Github: https://github.com/mbzuai-oryx/Video-ChatGPT
[22]
演示: https://www.ival-mbzuai.com/video-chatgpt
[23]
MIMIC-IT: 多模态上下文指导调优 : https://arxiv.org/pdf/2306.05425.pdf
[24]
Github: https://github.com/Luodian/Otter
[25]
演示: https://otter.cliangyu.com/
[26]
M3IT: 面向多模态多语言指导调优的大规模数据集 : https://arxiv.org/pdf/2306.04387.pdf
[27]
Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型 : https://arxiv.org/pdf/2306.02858.pdf
[28]
Github: https://github.com/DAMO-NLP-SG/Video-LLaMA
[29]
演示: https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA
[30]
LLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手 : https://arxiv.org/pdf/2306.00890.pdf
[31]
Github: https://github.com/microsoft/LLaVA-Med
[32]
GPT4Tools:通过自我指导教大型语言模型使用工具 : https://arxiv.org/pdf/2305.18752.pdf
[33]
Github: https://github.com/StevenGrove/GPT4Tools
[34]
Demo: https://huggingface.co/spaces/stevengrove/GPT4Tools
[35]
PandaGPT:一种用于全面指令跟随的模型 : https://arxiv.org/pdf/2305.16355.pdf
[36]
Github: https://github.com/yxuansu/PandaGPT
[37]
Demo: https://huggingface.co/spaces/GMFTBY/PandaGPT
[38]
ChatBridge:通过大型语言模型作为语言催化剂来桥接模式 : https://arxiv.org/pdf/2305.16103.pdf
[39]
Github: https://github.com/joez17/ChatBridge
[40]
简便快捷:大型语言模型的高效视觉语言指令调优 : https://arxiv.org/pdf/2305.15023.pdf
[41]
Github: https://github.com/luogen1996/LaVIN
[42]
DetGPT:通过推理检测你需要的东西 : https://arxiv.org/pdf/2305.14167.pdf
[43]
Github: https://github.com/OptimalScale/DetGPT
[44]
Demo: https://d3c431c0c77b1d9010.gradio.live/
[45]
VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器 : https://arxiv.org/pdf/2305.11175.pdf
[46]
Github: https://github.com/OpenGVLab/VisionLLM
[47]
Demo: https://igpt.opengvlab.com/
[48]
Listen, Think, and Understand : https://arxiv.org/pdf/2305.10790.pdf
[49]
Github: https://github.com/YuanGongND/ltu
[50]
Demo: https://github.com/YuanGongND/ltu
[51]
Github: https://github.com/THUDM/VisualGLM-6B
[52]
PMC-VQA: 医疗视觉问答的视觉指导优化 : https://arxiv.org/pdf/2305.10415.pdf
[53]
Github: https://github.com/xiaoman-zhang/PMC-VQA
[54]
InstructBLIP: 通过指导优化实现通用的视觉语言模型 : https://arxiv.org/pdf/2305.06500.pdf
[55]
Github: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
[56]
VideoChat: 以聊天为中心的视频理解 : https://arxiv.org/pdf/2305.06355.pdf
[57]
Github: https://github.com/OpenGVLab/Ask-Anything
[58]
Demo: https://ask.opengvlab.com/
[59]
MultiModal-GPT: 用于与人类对话的视觉和语言模型 : https://arxiv.org/pdf/2305.04790.pdf
[60]
Github: https://github.com/open-mmlab/Multimodal-GPT
[61]
Demo: https://mmgpt.openmmlab.org.cn/
[62]
X-LLM: 通过将多模态视为外语来引导先进的大型语言模型 : https://arxiv.org/pdf/2305.04160.pdf
[63]
Github: https://github.com/phellonchen/X-LLM
[64]
LMEye: 用于大型语言模型的交互式感知网络 : https://arxiv.org/pdf/2305.03701.pdf
[65]
Github: https://github.com/YunxinLi/LingCloud
[66]
LLaMA-Adapter V2: 高效参数的视觉指导模型 : https://arxiv.org/pdf/2304.15010.pdf
[67]
Github: https://github.com/ZrrSkywalker/LLaMA-Adapter
[68]
Demo: http://llama-adapter.opengvlab.com/
[69]
mPLUG-Owl: 模块化使大型语言模型具备多模态能力 : https://arxiv.org/pdf/2304.14178.pdf
[70]
Github: https://github.com/X-PLUG/mPLUG-Owl
[71]
Demo: https://huggingface.co/spaces/MAGAer13/mPLUG-Owl
[72]
MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解 : https://arxiv.org/pdf/2304.10592.pdf
[73]
Github: https://github.com/Vision-CAIR/MiniGPT-4
[74]
Visual Instruction Tuning : https://arxiv.org/pdf/2304.08485.pdf
[75]
GitHub: https://github.com/haotian-liu/LLaVA
[76]
Demo: https://llava.hliu.cc/
[77]
LLaMA-Adapter: 使用零初始化注意力高效微调语言模型 : https://arxiv.org/pdf/2303.16199.pdf
[78]
Github: https://github.com/ZrrSkywalker/LLaMA-Adapter
[79]
Demo: https://huggingface.co/spaces/csuhan/LLaMA-Adapter
[80]
MultiInstruct: 通过指导调整提高多模态零样本学习 : https://arxiv.org/pdf/2212.10773.pdf
[81]
Github: https://github.com/VT-NLP/MultiInstruct
[82]
MIMIC-IT: 多模态上下文中的指导调整 : https://arxiv.org/pdf/2306.05425.pdf
[83]
Github: https://github.com/Luodian/Otter
[84]
Demo: https://otter.cliangyu.com/
[85]
Chameleon: 使用大型语言模型进行即插即用的组合推理 : https://arxiv.org/pdf/2304.09842.pdf
[86]
Github: https://github.com/lupantech/chameleon-llm
[87]
Demo: https://chameleon-llm.github.io/
[88]
HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务 : https://arxiv.org/pdf/2303.17580.pdf
[89]
Github: https://github.com/microsoft/JARVIS
[90]
Demo: https://huggingface.co/spaces/microsoft/HuggingGPT
[91]
MM-REACT: 用于多模态推理和操作的ChatGPT提示 : https://arxiv.org/pdf/2303.11381.pdf
[92]
Github: https://github.com/microsoft/MM-REACT
[93]
Demo: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[94]
利用答案启发的提示为基于知识的视觉问答提供支持 : https://arxiv.org/pdf/2303.01903.pdf
[95]
Github: https://github.com/MILVLG/prophet
[96]
视觉编程:无需训练的组合视觉推理 : https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta\_Visual\_Programming\_Compositional\_Visual\_Reasoning\_Without\_Training\_CVPR\_2023\_paper.pdf
[97]
Github: https://github.com/allenai/visprog
[98]
GPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究 : https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974
[99]
Github: https://github.com/microsoft/PICa
[100]
Flamingo:一种用于少样本学习的视觉语言模型 : https://arxiv.org/pdf/2204.14198.pdf
[101]
Github: https://github.com/mlfoundations/open\_flamingo
[102]
演示: https://huggingface.co/spaces/dhansmair/flamingo-mini-cap
[103]
冻结语言模型的多模态少样本学习 : https://arxiv.org/pdf/2106.13884.pdf
[104]
EmbodiedGPT: 通过多模态思维链进行视觉语言预训练 : https://arxiv.org/pdf/2305.15021.pdf
[105]
Github: https://github.com/EmbodiedGPT/EmbodiedGPT\_Pytorch
[106]
让我们逐帧思考:通过视频补全和预测评估视频思维链 : https://arxiv.org/pdf/2305.13903.pdf
[107]
Caption Anything: 利用多样的多模态控制进行交互式图像描述 : https://arxiv.org/pdf/2305.02677.pdf
[108]
Github: https://github.com/ttengwang/Caption-Anything
[109]
演示: https://huggingface.co/spaces/TencentARC/Caption-Anything
[110]
视觉思维链:用多模态补全填补逻辑间隙 : https://arxiv.org/pdf/2305.02317.pdf
[111]
即将推出: https://github.com/dannyrose30/VCOT
[112]
Chameleon: 使用大型语言模型进行即插即用的组合推理 : https://arxiv.org/pdf/2304.09842.pdf
[113]
Github: https://github.com/lupantech/chameleon-llm
[114]
演示: https://chameleon-llm.github.io/
[115]
视觉语言模型中的思维链提示微调 : https://arxiv.org/pdf/2304.07919.pdf
[116]
MM-REACT: 多模态推理与交互式ChatGPT : https://arxiv.org/pdf/2303.11381.pdf
[117]
Github: https://github.com/microsoft/MM-REACT
[118]
演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[119]
Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑 : https://arxiv.org/pdf/2303.04671.pdf
[120]
Github: https://github.com/microsoft/TaskMatrix
[121]
演示: https://huggingface.co/spaces/microsoft/visual\_chatgpt
[122]
多模态思维链推理 : https://arxiv.org/pdf/2302.00923.pdf
[123]
Github: https://github.com/amazon-science/mm-cot
[124]
视觉编程:无需训练的组合视觉推理 : https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta\_Visual\_Programming\_Compositional\_Visual\_Reasoning\_Without\_Training\_CVPR\_2023\_paper.pdf
[125]
Github: https://github.com/allenai/visprog
[126]
学会解释:通过思维链进行多模态推理解答科学问题 : https://proceedings.neurips.cc/paper\_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
[127]
Github: https://github.com/lupantech/ScienceQA
[128]
GPT4Tools: 通过自我教育教授大型语言模型使用工具 : https://arxiv.org/pdf/2305.18752.pdf
[129]
Github: https://github.com/StevenGrove/GPT4Tools
[130]
演示: https://c60eb7e9400930f31b.gradio.live/
[131]
LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成 : https://arxiv.org/pdf/2305.15393.pdf
[132]
Github: https://github.com/weixi-feng/LayoutGPT
[133]
IdealGPT: 通过大型语言模型迭代分解视觉和语言推理 : https://arxiv.org/pdf/2305.14985.pdf
[134]
Github: https://github.com/Hxyou/IdealGPT
[135]
Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令 : https://arxiv.org/pdf/2303.05983.pdf
[136]
Github: https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat
[137]
Caption Anything: 多样多模态控制的交互式图像描述 : https://arxiv.org/pdf/2305.02677.pdf
[138]
Github: https://github.com/ttengwang/Caption-Anything
[139]
演示: https://huggingface.co/spaces/TencentARC/Caption-Anything
[140]
Chameleon: 大型语言模型的即插即用组合式推理 : https://arxiv.org/pdf/2304.09842.pdf
[141]
Github: https://github.com/lupantech/chameleon-llm
[142]
演示: https://chameleon-llm.github.io/
[143]
HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务 : https://arxiv.org/pdf/2303.17580.pdf
[144]
Github: https://github.com/microsoft/JARVIS
[145]
演示: https://huggingface.co/spaces/microsoft/HuggingGPT
[146]
MM-REACT: 多模态推理和行动中的ChatGPT提示 : https://arxiv.org/pdf/2303.11381.pdf
[147]
Github: https://github.com/microsoft/MM-REACT
[148]
演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[149]
ViperGPT: 通过Python执行进行视觉推理 : https://arxiv.org/pdf/2303.08128.pdf
[150]
Github: https://github.com/cvlab-columbia/viper
[151]
ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问 : https://arxiv.org/pdf/2303.06594.pdf
[152]
Github: https://github.com/Vision-CAIR/ChatCaptioner
[153]
Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑 : https://arxiv.org/pdf/2303.04671.pdf
[154]
Github: https://github.com/microsoft/TaskMatrix
[155]
演示: https://huggingface.co/spaces/microsoft/visual\_chatgpt
[156]
Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器 : https://arxiv.org/pdf/2303.02151.pdf
[157]
Github: https://github.com/ZrrSkywalker/CaFo
[158]
PointCLIP V2: 适应强大的3D开放世界学习的CLIP : https://arxiv.org/pdf/2211.11682.pdf
[159]
Github: https://github.com/yangyangyang127/PointCLIP\_V2
[160]
Visual Programming: 无需训练的组合式视觉推理 : https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta\_Visual\_Programming\_Compositional\_Visual\_Reasoning\_Without\_Training\_CVPR\_2023\_paper.pdf
[161]
Github: https://github.com/allenai/visprog
[162]
Socratic Models: 使用语言进行零样本多模态推理 : https://arxiv.org/pdf/2204.00598.pdf
[163]
Github: https://github.com/google-research/google-research/tree/master/socraticmodels
[164]
Transfer Visual Prompt Generator across LLMs : https://arxiv.org/pdf/2305.01278.pdf
[165]
Github: https://github.com/VPGTrans/VPGTrans
[166]
演示: https://3fc7715dbc44234a7f.gradio.live/
[167]
GPT-4 技术报告 : https://arxiv.org/pdf/2303.08774.pdf
[168]
PaLM-E: 一种具有身体感知的多模态语言模型 : https://arxiv.org/pdf/2303.03378.pdf
[169]
演示: https://palm-e.github.io/#demo
[170]
Prismer: 具有多个专家的视觉语言模型 : https://arxiv.org/pdf/2303.02506.pdf
[171]
Github: https://github.com/NVlabs/prismer
[172]
演示: https://huggingface.co/spaces/lorenmt/prismer
[173]
语言并非唯一需求:将感知与语言模型对齐 : https://arxiv.org/pdf/2302.14045.pdf
[174]
Github: https://github.com/microsoft/unilm
[175]
BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练 : https://arxiv.org/pdf/2301.12597.pdf
[176]
Github: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
[177]
演示: https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2\_instructed\_generation.ipynb
[178]
VIMA: 基于多模态提示的通用机器人操作 : https://arxiv.org/pdf/2210.03094.pdf
[179]
Github: https://github.com/vimalabs/VIMA
[180]
大型预训练模型能帮助视觉模型处理感知任务吗? : https://arxiv.org/pdf/2306.00693.pdf
[181]
即将推出:
[182]
多模态大型语言模型在上下文目标检测中的应用 : https://arxiv.org/pdf/2305.18279.pdf
[183]
Github: https://github.com/yuhangzang/ContextDET
[184]
演示: https://huggingface.co/spaces/yuhangzang/ContextDet-Demo
[185]
利用多模态语言模型生成图像 : https://arxiv.org/pdf/2305.17216.pdf
[186]
Github: https://github.com/kohjingyu/gill
[187]
评估大型视觉-语言模型的对抗鲁棒性 : https://arxiv.org/pdf/2305.16934.pdf
[188]
Github: https://github.com/yunqing-me/AttackVLM
[189]
在大型视觉-语言模型中评估对象虚构 : https://arxiv.org/pdf/2305.10355.pdf
[190]
Github: https://github.com/RUCAIBox/POPE
[191]
将语言模型与图像进行模态间输入输出关联 : https://arxiv.org/pdf/2301.13823.pdf
[192]
Github: https://github.com/kohjingyu/fromage
[193]
演示: https://huggingface.co/spaces/jykoh/fromage
[194]
Microsoft COCO: Common Objects in Context: https://arxiv.org/pdf/1405.0312.pdf
[195]
Im2Text: Describing Images Using 1 Million Captioned Photographs: https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
[196]
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning: https://aclanthology.org/P18-1238.pdf
[197]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs: https://arxiv.org/pdf/2111.02114.pdf
[198]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations: https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
[199]
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models: https://openaccess.thecvf.com/content\_iccv\_2015/papers/Plummer\_Flickr30k\_Entities\_Collecting\_ICCV\_2015\_paper.pdf
[200]
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding: https://arxiv.org/pdf/1711.06475.pdf
[201]
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark: https://proceedings.neurips.cc/paper\_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets\_and\_Benchmarks.pdf
[202]
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks: https://arxiv.org/pdf/2306.04362.pdf
[203]
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language: https://openaccess.thecvf.com/content\_cvpr\_2016/papers/Xu\_MSR-VTT\_A\_Large\_CVPR\_2016\_paper.pdf
[204]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval: https://arxiv.org/pdf/2104.00650.pdf
[205]
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research: https://arxiv.org/pdf/2303.17395.pdf
[206]
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline: https://arxiv.org/pdf/1709.05522.pdf
[207]
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale: https://arxiv.org/pdf/1808.10583.pdf
[208]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf
[209]
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration: https://arxiv.org/pdf/2306.09093.pdf
[210]
链接: https://github.com/lyuchenyang/Macaw-LLM/tree/main/data
[211]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf
[212]
链接: https://github.com/OpenLAMM/LAMM#lamm-dataset
[213]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf
[214]
链接: https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open\_file\_folder
[215]
MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf
[216]
即将推出: https://github.com/Luodian/Otter
[217]
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning: https://arxiv.org/pdf/2306.04387.pdf
[218]
链接: https://huggingface.co/datasets/MMInstruction/M3IT
[219]
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day: https://arxiv.org/pdf/2306.00890.pdf
[220]
即将推出: https://github.com/microsoft/LLaVA-Med#llava-med-dataset
[221]
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction: https://arxiv.org/pdf/2305.18752.pdf
[222]
链接: https://github.com/StevenGrove/GPT4Tools#dataset
[223]
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst: https://arxiv.org/pdf/2305.16103.pdf
[224]
即将推出: https://iva-chatbridge.github.io/
[225]
DetGPT: Detect What You Need via Reasoning: https://arxiv.org/pdf/2305.14167.pdf
[226]
链接: https://github.com/OptimalScale/DetGPT/tree/main/dataset
[227]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering: https://arxiv.org/pdf/2305.10415.pdf
[228]
即将推出: https://xiaoman-zhang.github.io/PMC-VQA/
[229]
VideoChat: Chat-Centric Video Understanding: https://arxiv.org/pdf/2305.06355.pdf
[230]
链接: https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction\_data
[231]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf
[232]
链接: https://github.com/phellonchen/X-LLM
[233]
LMEye: An Interactive Perception Network for Large Language Models: https://arxiv.org/pdf/2305.03701.pdf
[234]
链接: https://huggingface.co/datasets/YunxinLi/Multimodal\_Insturction\_Data\_V2
[235]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models: https://arxiv.org/pdf/2304.10592.pdf
[236]
链接: https://huggingface.co/datasets/Vision-CAIR/cc\_sbu\_align
[237]
Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf
[238]
链接: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
[239]
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning: https://arxiv.org/pdf/2212.10773.pdf
[240]
链接: https://github.com/VT-NLP/MultiInstruct
[241]
MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf
[242]
即将推出: https://github.com/Luodian/Otter
[243]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought: https://arxiv.org/pdf/2305.15021.pdf
[244]
即将推出: https://github.com/EmbodiedGPT/EmbodiedGPT\_Pytorch
[245]
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction: https://arxiv.org/pdf/2305.13903.pdf
[246]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering: https://proceedings.neurips.cc/paper\_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
[247]
链接: https://github.com/lupantech/ScienceQA#ghost-download-the-dataset
[248]
IMAD: IMage-Augmented multi-modal Dialogue: https://arxiv.org/pdf/2305.10512.pdf
[249]
链接: https://github.com/VityaVitalich/IMAD
[250]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf
[251]
链接: https://github.com/OpenLAMM/LAMM#lamm-benchmark
[252]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality: https://arxiv.org/pdf/2304.14178.pdf
[253]
链接: https://github.com/X-PLUG/mPLUG-Owl/tree/main/OwlEval
[254]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf
[255]
链接: https://github.com/mbzuai-oryx/Video-ChatGPT#quantitative-evaluation-bar\_chart
[256]
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models: https://arxiv.org/pdf/2306.09265.pdf
[257]
链接: https://github.com/OpenGVLab/Multi-Modality-Arena
[258]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf
[259]
链接: https://drive.google.com/drive/folders/1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft
[260]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf
[261]
链接: https://drive.google.com/drive/folders/1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0
[262]
GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models