优秀的多模态大模型(LLM)资源库

技术
前言

在AI盛起的当下,各类AI应用不断地出现在人们的视野中,AI正在重塑着各行各业。笔者认为,如果说ChatGPT引领了AI革命的开端,那么多模态大模型一定代表着AI应用的未来。

本文是一个多模态大语言模型的资源库,里面罗列了大大小小很多个多模态大语言模型的论文、应用、数据集等学习资源,建议大家点赞收藏。

对于本文中的部分项目,笔者之前也有文章介绍,部分罗列如下,感兴趣的同学可以查看:

GPT4All——可本地布署的AI助理

MiniGPT-4:使用先进的大型语言模型提升视觉语言理解

Audiocraft——一个基于PyTorch的AI音频生成深度学习研究库

Recognize_Anything-Tag2Text——一款强大的图像标签模型和Tag2Text

MLC LLM——本地应用程序上原生部署任何语言模型

......

超棒的-多模态-大型语言模型资源库

🔥🔥🔥 这是一个精心策划的 多模态大型语言模型(MLLM) 列表,包括 数据集多模态指令调整多模态情境学习多模态思维链条由LLM辅助的视觉推理基础模型 ,以及 其他

🔥🔥🔥 这个列表会实时更新。

🔥🔥🔥 MLLM的综述论文正在准备中,很快就会发布!


目录

• 超棒的论文[1] •多模态指令调整[2] •多模态情境学习[3] •多模态思维链条[4] •由LLM辅助的视觉推理[5] •基础模型[6] •其他[7]

• 超棒的数据集[8] •对齐预训练的数据集[9] •多模态指令调整的数据集[10] •情境学习的数据集[11] •多模态思维链条的数据集[12] •其他[13]


优秀论文

下面的部分论文笔者有中文版,有需要的可以联系笔者获取。

多模态指导调优

| 标题 | 发布会议 | 日期 | 代码 | 演示 | |

Star

Macaw-LLM: 图像,音频,视频和文本整合的多模态语言模型 [14] | arXiv | 2023-06-15 | Github[15] | 即将到来[16] | |

Star

LAMM: 语言辅助的多模态指导调优数据集,框架和基准 [17] | arXiv | 2023-06-11 | Github[18] | 演示[19] | |

Star

Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解 [20] | arXiv | 2023-06-08 | Github[21] | 演示[22] | |

Star

MIMIC-IT: 多模态上下文指导调优 [23] | arXiv | 2023-06-08 | Github[24] | 演示[25] | | M3IT: 面向多模态多语言指导调优的大规模数据集 [26] | arXiv | 2023-06-07 | - | - | |

Star

Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型 [27] | arXiv | 2023-06-05 | Github[28] | 演示[29] | |

Star

LLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手 [30] | arXiv | 2023-06-01 | Github[31] | - | |

Star

GPT4Tools:通过自我指导教大型语言模型使用工具 [32] | arXiv | 2023-05-30 | Github[33] | Demo[34] | |

Star

PandaGPT:一种用于全面指令跟随的模型 [35] | arXiv | 2023-05-25 | Github[36] | Demo[37] | |

Star

ChatBridge:通过大型语言模型作为语言催化剂来桥接模式 [38] | arXiv | 2023-05-25 | Github[39] | - | |

Star

简便快捷:大型语言模型的高效视觉语言指令调优 [40] | arXiv | 2023-05-24 | Github[41] | 本地演示 | |

Star

DetGPT:通过推理检测你需要的东西 [42] | arXiv | 2023-05-23 | Github[43] | Demo[44] | |

Star

VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器 [45] | arXiv | 2023-05-18 | Github[46] | Demo[47] | |

Star

Listen, Think, and Understand [48] | arXiv | 2023-05-18 | Github[49] | Demo[50] | |

Star

VisualGLM-6B | - | 2023-05-17 | Github[51] | 本地演示 | |

Star

PMC-VQA: 医疗视觉问答的视觉指导优化 [52] | arXiv | 2023-05-17 | Github[53] | - | |

Star

InstructBLIP: 通过指导优化实现通用的视觉语言模型 [54] | arXiv | 2023-05-11 | Github[55] | 本地演示 | |

Star

VideoChat: 以聊天为中心的视频理解 [56] | arXiv | 2023-05-10 | Github[57] | Demo[58] | |

Star

MultiModal-GPT: 用于与人类对话的视觉和语言模型 [59] | arXiv | 2023-05-08 | Github[60] | Demo[61] | |

Star

X-LLM: 通过将多模态视为外语来引导先进的大型语言模型 [62] | arXiv | 2023-05-07 | Github[63] | - | |

Star

LMEye: 用于大型语言模型的交互式感知网络 [64] | arXiv | 2023-05-05 | Github[65] | 本地演示 | |

Star

LLaMA-Adapter V2: 高效参数的视觉指导模型 [66] | arXiv | 2023-04-28 | Github[67] | Demo[68] | |

Star

mPLUG-Owl: 模块化使大型语言模型具备多模态能力 [69] | arXiv | 2023-04-27 | Github[70] | Demo[71] | |

Star

MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解 [72] | arXiv | 2023-04-20 | Github[73] | - | |

Star

Visual Instruction Tuning [74] | arXiv | 2023-04-17 | GitHub[75] | Demo[76] | |

Star

LLaMA-Adapter: 使用零初始化注意力高效微调语言模型 [77] | arXiv | 2023-03-28 | Github[78] | Demo[79] | |

Star

MultiInstruct: 通过指导调整提高多模态零样本学习 [80] | ACL | 2022-12-21 | Github[81] | - |

中文版论文

笔者整理了部分论文的中文版,有需要的可以私聊笔者获取,大概效果如下:

picture.image

多模态上下文学习

| Title | Venue | Date | Code | Demo | |

Star

MIMIC-IT: 多模态上下文中的指导调整 [82] | arXiv | 2023-06-08 | Github[83] | Demo[84] | |

Star

Chameleon: 使用大型语言模型进行即插即用的组合推理 [85] | arXiv | 2023-04-19 | Github[86] | Demo[87] | |

Star

HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务 [88] | arXiv | 2023-03-30 | Github[89] | Demo[90] | |

Star

MM-REACT: 用于多模态推理和操作的ChatGPT提示 [91] | arXiv | 2023-03-20 | Github[92] | Demo[93] | |

Star

利用答案启发的提示为基于知识的视觉问答提供支持 [94] | CVPR | 2023-03-03 | Github[95] | - | |

Star

视觉编程:无需训练的组合视觉推理 [96] | CVPR | 2022-11-18 | Github[97] | Local Demo | |

Star

GPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究 [98] | AAAI | 2022-06-28 | Github[99] | - | |

Star

Flamingo:一种用于少样本学习的视觉语言模型 [100] | NeurIPS | 2022-04-29 | Github[101] | 演示[102] | | 冻结语言模型的多模态少样本学习 [103] | NeurIPS | 2021-06-25 | - | - |

多模态思维链

| 标题 | 会议/期刊 | 日期 | 代码 | 演示 | |

Star

EmbodiedGPT: 通过多模态思维链进行视觉语言预训练 [104] | arXiv | 2023-05-24 | Github[105] | - | | 让我们逐帧思考:通过视频补全和预测评估视频思维链 [106] | arXiv | 2023-05-23 | - | - | |

Star

Caption Anything: 利用多样的多模态控制进行交互式图像描述 [107] | arXiv | 2023-05-04 | Github[108] | 演示[109] | | 视觉思维链:用多模态补全填补逻辑间隙 [110] | arXiv | 2023-05-03 | 即将推出[111] | - | |

Star

Chameleon: 使用大型语言模型进行即插即用的组合推理 [112] | arXiv | 2023-04-19 | Github[113] | 演示[114] | | 视觉语言模型中的思维链提示微调 [115] | arXiv | 2023-04-16 | 即将推出 | - | |

Star

MM-REACT: 多模态推理与交互式ChatGPT [116] | arXiv | 2023-03-20 | Github[117] | 演示[118] | |

Star

Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑 [119] | arXiv | 2023-03-08 | Github[120] | 演示[121] | |

Star

多模态思维链推理 [122] | arXiv | 2023-02-02 | Github[123] | - | |

Star

视觉编程:无需训练的组合视觉推理 [124] | CVPR | 2022-11-18 | Github[125] | 本地演示 | |

Star

学会解释:通过思维链进行多模态推理解答科学问题 [126] | NeurIPS | 2022-09-20 | Github[127] | - |

LLM辅助的视觉推理

| 标题 | 会议 | 日期 | 代码 | 演示 | |

Star

GPT4Tools: 通过自我教育教授大型语言模型使用工具 [128] | arXiv | 2023-05-30 | Github[129] | 演示[130] | |

Star

LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成 [131] | arXiv | 2023-05-24 | Github[132] | - | |

Star

IdealGPT: 通过大型语言模型迭代分解视觉和语言推理 [133] | arXiv | 2023-05-24 | Github[134] | 本地演示 | |

Star

Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令 [135] | arXiv | 2023-05-10 | Github[136] | - | |

Star

Caption Anything: 多样多模态控制的交互式图像描述 [137] | arXiv | 2023-05-04 | Github[138] | 演示[139] | |

Star

Chameleon: 大型语言模型的即插即用组合式推理 [140] | arXiv | 2023-04-19 | Github[141] | 演示[142] | |

Star

HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务 [143] | arXiv | 2023-03-30 | Github[144] | 演示[145] | |

Star

MM-REACT: 多模态推理和行动中的ChatGPT提示 [146] | arXiv | 2023-03-20 | Github[147] | 演示[148] | |

Star

ViperGPT: 通过Python执行进行视觉推理 [149] | arXiv | 2023-03-14 | Github[150] | 本地演示 | |

Star

ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问 [151] | arXiv | 2023-03-12 | Github[152] | 本地演示 | |

Star

Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑 [153] | arXiv | 2023-03-08 | Github[154] | 演示[155] | |

Star

Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器 [156] | CVPR | 2023-03-03 | Github[157] | - | |

Star

PointCLIP V2: 适应强大的3D开放世界学习的CLIP [158] | CVPR | 2022-11-21 | Github[159] | - | |

Star

Visual Programming: 无需训练的组合式视觉推理 [160] | CVPR | 2022-11-18 | Github[161] | 本地演示 | |

Star

Socratic Models: 使用语言进行零样本多模态推理 [162] | arXiv | 2022-04-01 | Github[163] | - |

基础模型

| 标题 | 发表会议/期刊 | 日期 | 代码 | 演示 | |

Star

Transfer Visual Prompt Generator across LLMs [164] | arXiv | 2023-05-02 | Github[165] | 演示[166] | | GPT-4 技术报告 [167] | arXiv | 2023-03-15 | - | - | | PaLM-E: 一种具有身体感知的多模态语言模型 [168] | arXiv | 2023-03-06 | - | 演示[169] | |

Star

Prismer: 具有多个专家的视觉语言模型 [170] | arXiv | 2023-03-04 | Github[171] | 演示[172] | |

Star

语言并非唯一需求:将感知与语言模型对齐 [173] | arXiv | 2023-02-27 | Github[174] | - | |

Star

BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练 [175] | arXiv | 2023-01-30 | Github[176] | 演示[177] | |

Star

VIMA: 基于多模态提示的通用机器人操作 [178] | ICML | 2022-10-06 | Github[179] | |

其他

| 标题 | 发表会议/期刊 | 日期 | 代码 | 演示 | | 大型预训练模型能帮助视觉模型处理感知任务吗? [180] | arXiv | 2023-06-01 | 即将推出[181] | - | |

Star

多模态大型语言模型在上下文目标检测中的应用 [182] | arXiv | 2023-05-29 | Github[183] | 演示[184] | |

Star

利用多模态语言模型生成图像 [185] | arXiv | 2023-05-26 | Github[186] | - | |

Star

评估大型视觉-语言模型的对抗鲁棒性 [187] | arXiv | 2023-05-26 | Github[188] | - | |

Star

在大型视觉-语言模型中评估对象虚构 [189] | arXiv | 2023-05-17 | Github[190] | - | |

Star

将语言模型与图像进行模态间输入输出关联 [191] | ICML | 2023-01-31 | Github[192] | 演示[193] |


精彩数据集

用于对齐的预训练数据集

| 名称 | 论文 | 类型 | 模态 | | MS-COCO | Microsoft COCO: Common Objects in Context[194] | 标题 | 图像-文本 | | SBU Captions | Im2Text: Describing Images Using 1 Million Captioned Photographs[195] | 标题 | 图像-文本 | | Conceptual Captions | Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning[196] | 标题 | 图像-文本 | | LAION-400M | LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs[197] | 标题 | 图像-文本 | | VG Captions | Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[198] | 标题 | 图像-文本 | | Flickr30k | Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[199] | 标题 | 图像-文本 | | AI-Caps | AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding[200] | 标题 | 图像-文本 | | 悟空标注 | Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark[201] | 标题 | 图像-文本 | | Youku-mPLUG | Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks[202] | 标题 | 视频-文本 | | MSR-VTT | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language[203] | 标题 | 视频-文本 | | Webvid10M | Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval[204] | 标题 | 视频-文本 | | WavCaps | WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research[205] | 标题 | 音频-文本 | | AISHELL-1 | AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline[206] | ASR | 音频-文本 | | AISHELL-2 | AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale[207] | ASR | 音频-文本 | | VSDial-CN | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[208] | ASR | 图像-音频-文本 |

多模态指令调整数据集

| 名称 | 论文 | 链接 | 备注 | | Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration[209] | 链接[210] | 一个大规模的多模态指令数据集,具有多轮对话 | | LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[211] | 链接[212] | 一个全面的多模态指令调整数据集 | | Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[213] | 链接[214] | 10万个高质量视频指令数据集 | | MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning[215] | 即将推出[216] | 多模态上下文指令调整 | | M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[217] | 链接[218] | 大规模、广覆盖的多模态指令调整数据集 | | LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day[219] | 即将推出[220] | 一个大规模、广覆盖的生物医学指令跟随数据集 | | GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction[221] | 链接[222] | 与工具相关的指令数据集 | | MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst[223] | 即将推出[224] | 覆盖16种多模态任务的多模态指令调整数据集 | | DetGPT | DetGPT: Detect What You Need via Reasoning[225] | 链接[226] | 一个包含5000张图像和约30000个查询-回答对的指令调整数据集 | | PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering[227] | 即将推出[228] | 大规模的医学视觉问答数据集 | | VideoChat | VideoChat: Chat-Centric Video Understanding[229] | 链接[230] | 以视频为中心的多模态指令数据集 | | X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[231] | 链接[232] | 中文多模态指令数据集 | | LMEye | LMEye: An Interactive Perception Network for Large Language Models[233] | 链接[234] | 一个多模态指令调整数据集 | | cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models[235] | 链接[236] | 用于改善模型可用性和生成流畅性的多模态对齐数据集 | | LLaVA-Instruct-150K | Visual Instruction Tuning[237] | 链接[238] | 由GPT生成的多模态指令跟随数据 | | MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning[239] | 链接[240] | 第一个多模态指令调整基准数据集 |

在上下文学习中的数据集

| 名称 | 论文 | 链接 | 备注 | | MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning[241] | 即将推出[242] | 多模态上下文指令数据集 |

在多模态思维链中的数据集

| 名称 | 论文 | 链接 | 备注 | | EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought[243] | 即将推出[244] | 大规模的具身化规划数据集 | | VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction[245] | 即将推出 | 用于评估VideoCOT的推理时间数据集 | | ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering[246] | 链接[247] | 大规模的多项选择数据集,涵盖了多模态科学问题和多个领域 |

其他数据集

| 名称 | 论文 | 链接 | 备注 | | IMAD | IMAD: IMage-Augmented multi-modal Dialogue[248] | 链接[249] | 多模态对话数据集 | | LAMM-Benchmark | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[250] | 链接[251] | 用于评估MLLM在各种2D/3D视觉任务上的定量性能的基准测试 | | OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality[252] | 链接[253] | 用于评估多种能力的数据集 | | Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[254] | 链接[255] | 用于视频对话模型的定量评估框架 | | LVLM-eHub | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models[256] | 链接[257] | MLLM的评估平台 | | CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[258] | 链接[259] | 用于学习拒绝指令的合成多模态微调数据集 | | Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[260] | 链接[261] | 用于学习拒绝指令的手动拍摄的多模态微调数据集 |

声明

文章内容主要翻译整理自:GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models[262],后续会持续更新,请点赞收藏!

References

[1] 超棒的论文: #超棒的论文
[2] 多模态指令调整: #多模态指令调整
[3] 多模态情境学习: #多模态情境学习
[4] 多模态思维链条: #多模态思维链条
[5] 由LLM辅助的视觉推理: #由llm辅助的视觉推理
[6] 基础模型: #基础模型
[7] 其他: #其他
[8] 超棒的数据集: #超棒的数据集
[9] 对齐预训练的数据集: #对齐预训练的数据集
[10] 多模态指令调整的数据集: #多模态指令调整的数据集
[11] 情境学习的数据集: #情境学习的数据集
[12] 多模态思维链条的数据集: #多模态思维链条的数据集
[13] 其他: #其他-1
[14] Macaw-LLM: 图像,音频,视频和文本整合的多模态语言模型 : https://arxiv.org/pdf/2306.09093.pdf
[15] Github: https://github.com/lyuchenyang/Macaw-LLM
[16] 即将到来:
[17] LAMM: 语言辅助的多模态指导调优数据集,框架和基准 : https://arxiv.org/pdf/2306.06687.pdf
[18] Github: https://github.com/OpenLAMM/LAMM
[19] 演示: https://huggingface.co/spaces/openlamm/LAMM
[20] Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解 : https://arxiv.org/pdf/2306.05424.pdf
[21] Github: https://github.com/mbzuai-oryx/Video-ChatGPT
[22] 演示: https://www.ival-mbzuai.com/video-chatgpt
[23] MIMIC-IT: 多模态上下文指导调优 : https://arxiv.org/pdf/2306.05425.pdf
[24] Github: https://github.com/Luodian/Otter
[25] 演示: https://otter.cliangyu.com/
[26] M3IT: 面向多模态多语言指导调优的大规模数据集 : https://arxiv.org/pdf/2306.04387.pdf
[27] Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型 : https://arxiv.org/pdf/2306.02858.pdf
[28] Github: https://github.com/DAMO-NLP-SG/Video-LLaMA
[29] 演示: https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA
[30] LLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手 : https://arxiv.org/pdf/2306.00890.pdf
[31] Github: https://github.com/microsoft/LLaVA-Med
[32] GPT4Tools:通过自我指导教大型语言模型使用工具 : https://arxiv.org/pdf/2305.18752.pdf
[33] Github: https://github.com/StevenGrove/GPT4Tools
[34] Demo: https://huggingface.co/spaces/stevengrove/GPT4Tools
[35] PandaGPT:一种用于全面指令跟随的模型 : https://arxiv.org/pdf/2305.16355.pdf
[36] Github: https://github.com/yxuansu/PandaGPT
[37] Demo: https://huggingface.co/spaces/GMFTBY/PandaGPT
[38] ChatBridge:通过大型语言模型作为语言催化剂来桥接模式 : https://arxiv.org/pdf/2305.16103.pdf
[39] Github: https://github.com/joez17/ChatBridge
[40] 简便快捷:大型语言模型的高效视觉语言指令调优 : https://arxiv.org/pdf/2305.15023.pdf
[41] Github: https://github.com/luogen1996/LaVIN
[42] DetGPT:通过推理检测你需要的东西 : https://arxiv.org/pdf/2305.14167.pdf
[43] Github: https://github.com/OptimalScale/DetGPT
[44] Demo: https://d3c431c0c77b1d9010.gradio.live/
[45] VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器 : https://arxiv.org/pdf/2305.11175.pdf
[46] Github: https://github.com/OpenGVLab/VisionLLM
[47] Demo: https://igpt.opengvlab.com/
[48] Listen, Think, and Understand : https://arxiv.org/pdf/2305.10790.pdf
[49] Github: https://github.com/YuanGongND/ltu
[50] Demo: https://github.com/YuanGongND/ltu
[51] Github: https://github.com/THUDM/VisualGLM-6B
[52] PMC-VQA: 医疗视觉问答的视觉指导优化 : https://arxiv.org/pdf/2305.10415.pdf
[53] Github: https://github.com/xiaoman-zhang/PMC-VQA
[54] InstructBLIP: 通过指导优化实现通用的视觉语言模型 : https://arxiv.org/pdf/2305.06500.pdf
[55] Github: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
[56] VideoChat: 以聊天为中心的视频理解 : https://arxiv.org/pdf/2305.06355.pdf
[57] Github: https://github.com/OpenGVLab/Ask-Anything
[58] Demo: https://ask.opengvlab.com/
[59] MultiModal-GPT: 用于与人类对话的视觉和语言模型 : https://arxiv.org/pdf/2305.04790.pdf
[60] Github: https://github.com/open-mmlab/Multimodal-GPT
[61] Demo: https://mmgpt.openmmlab.org.cn/
[62] X-LLM: 通过将多模态视为外语来引导先进的大型语言模型 : https://arxiv.org/pdf/2305.04160.pdf
[63] Github: https://github.com/phellonchen/X-LLM
[64] LMEye: 用于大型语言模型的交互式感知网络 : https://arxiv.org/pdf/2305.03701.pdf
[65] Github: https://github.com/YunxinLi/LingCloud
[66] LLaMA-Adapter V2: 高效参数的视觉指导模型 : https://arxiv.org/pdf/2304.15010.pdf
[67] Github: https://github.com/ZrrSkywalker/LLaMA-Adapter
[68] Demo: http://llama-adapter.opengvlab.com/
[69] mPLUG-Owl: 模块化使大型语言模型具备多模态能力 : https://arxiv.org/pdf/2304.14178.pdf
[70] Github: https://github.com/X-PLUG/mPLUG-Owl
[71] Demo: https://huggingface.co/spaces/MAGAer13/mPLUG-Owl
[72] MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解 : https://arxiv.org/pdf/2304.10592.pdf
[73] Github: https://github.com/Vision-CAIR/MiniGPT-4
[74] Visual Instruction Tuning : https://arxiv.org/pdf/2304.08485.pdf
[75] GitHub: https://github.com/haotian-liu/LLaVA
[76] Demo: https://llava.hliu.cc/
[77] LLaMA-Adapter: 使用零初始化注意力高效微调语言模型 : https://arxiv.org/pdf/2303.16199.pdf
[78] Github: https://github.com/ZrrSkywalker/LLaMA-Adapter
[79] Demo: https://huggingface.co/spaces/csuhan/LLaMA-Adapter
[80] MultiInstruct: 通过指导调整提高多模态零样本学习 : https://arxiv.org/pdf/2212.10773.pdf
[81] Github: https://github.com/VT-NLP/MultiInstruct
[82] MIMIC-IT: 多模态上下文中的指导调整 : https://arxiv.org/pdf/2306.05425.pdf
[83] Github: https://github.com/Luodian/Otter
[84] Demo: https://otter.cliangyu.com/
[85] Chameleon: 使用大型语言模型进行即插即用的组合推理 : https://arxiv.org/pdf/2304.09842.pdf
[86] Github: https://github.com/lupantech/chameleon-llm
[87] Demo: https://chameleon-llm.github.io/
[88] HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务 : https://arxiv.org/pdf/2303.17580.pdf
[89] Github: https://github.com/microsoft/JARVIS
[90] Demo: https://huggingface.co/spaces/microsoft/HuggingGPT
[91] MM-REACT: 用于多模态推理和操作的ChatGPT提示 : https://arxiv.org/pdf/2303.11381.pdf
[92] Github: https://github.com/microsoft/MM-REACT
[93] Demo: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[94] 利用答案启发的提示为基于知识的视觉问答提供支持 : https://arxiv.org/pdf/2303.01903.pdf
[95] Github: https://github.com/MILVLG/prophet
[96] 视觉编程:无需训练的组合视觉推理 : https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta\_Visual\_Programming\_Compositional\_Visual\_Reasoning\_Without\_Training\_CVPR\_2023\_paper.pdf
[97] Github: https://github.com/allenai/visprog
[98] GPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究 : https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974
[99] Github: https://github.com/microsoft/PICa
[100] Flamingo:一种用于少样本学习的视觉语言模型 : https://arxiv.org/pdf/2204.14198.pdf
[101] Github: https://github.com/mlfoundations/open\_flamingo
[102] 演示: https://huggingface.co/spaces/dhansmair/flamingo-mini-cap
[103] 冻结语言模型的多模态少样本学习 : https://arxiv.org/pdf/2106.13884.pdf
[104] EmbodiedGPT: 通过多模态思维链进行视觉语言预训练 : https://arxiv.org/pdf/2305.15021.pdf
[105] Github: https://github.com/EmbodiedGPT/EmbodiedGPT\_Pytorch
[106] 让我们逐帧思考:通过视频补全和预测评估视频思维链 : https://arxiv.org/pdf/2305.13903.pdf
[107] Caption Anything: 利用多样的多模态控制进行交互式图像描述 : https://arxiv.org/pdf/2305.02677.pdf
[108] Github: https://github.com/ttengwang/Caption-Anything
[109] 演示: https://huggingface.co/spaces/TencentARC/Caption-Anything
[110] 视觉思维链:用多模态补全填补逻辑间隙 : https://arxiv.org/pdf/2305.02317.pdf
[111] 即将推出: https://github.com/dannyrose30/VCOT
[112] Chameleon: 使用大型语言模型进行即插即用的组合推理 : https://arxiv.org/pdf/2304.09842.pdf
[113] Github: https://github.com/lupantech/chameleon-llm
[114] 演示: https://chameleon-llm.github.io/
[115] 视觉语言模型中的思维链提示微调 : https://arxiv.org/pdf/2304.07919.pdf
[116] MM-REACT: 多模态推理与交互式ChatGPT : https://arxiv.org/pdf/2303.11381.pdf
[117] Github: https://github.com/microsoft/MM-REACT
[118] 演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[119] Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑 : https://arxiv.org/pdf/2303.04671.pdf
[120] Github: https://github.com/microsoft/TaskMatrix
[121] 演示: https://huggingface.co/spaces/microsoft/visual\_chatgpt
[122] 多模态思维链推理 : https://arxiv.org/pdf/2302.00923.pdf
[123] Github: https://github.com/amazon-science/mm-cot
[124] 视觉编程:无需训练的组合视觉推理 : https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta\_Visual\_Programming\_Compositional\_Visual\_Reasoning\_Without\_Training\_CVPR\_2023\_paper.pdf
[125] Github: https://github.com/allenai/visprog
[126] 学会解释:通过思维链进行多模态推理解答科学问题 : https://proceedings.neurips.cc/paper\_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
[127] Github: https://github.com/lupantech/ScienceQA
[128] GPT4Tools: 通过自我教育教授大型语言模型使用工具 : https://arxiv.org/pdf/2305.18752.pdf
[129] Github: https://github.com/StevenGrove/GPT4Tools
[130] 演示: https://c60eb7e9400930f31b.gradio.live/
[131] LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成 : https://arxiv.org/pdf/2305.15393.pdf
[132] Github: https://github.com/weixi-feng/LayoutGPT
[133] IdealGPT: 通过大型语言模型迭代分解视觉和语言推理 : https://arxiv.org/pdf/2305.14985.pdf
[134] Github: https://github.com/Hxyou/IdealGPT
[135] Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令 : https://arxiv.org/pdf/2303.05983.pdf
[136] Github: https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat
[137] Caption Anything: 多样多模态控制的交互式图像描述 : https://arxiv.org/pdf/2305.02677.pdf
[138] Github: https://github.com/ttengwang/Caption-Anything
[139] 演示: https://huggingface.co/spaces/TencentARC/Caption-Anything
[140] Chameleon: 大型语言模型的即插即用组合式推理 : https://arxiv.org/pdf/2304.09842.pdf
[141] Github: https://github.com/lupantech/chameleon-llm
[142] 演示: https://chameleon-llm.github.io/
[143] HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务 : https://arxiv.org/pdf/2303.17580.pdf
[144] Github: https://github.com/microsoft/JARVIS
[145] 演示: https://huggingface.co/spaces/microsoft/HuggingGPT
[146] MM-REACT: 多模态推理和行动中的ChatGPT提示 : https://arxiv.org/pdf/2303.11381.pdf
[147] Github: https://github.com/microsoft/MM-REACT
[148] 演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[149] ViperGPT: 通过Python执行进行视觉推理 : https://arxiv.org/pdf/2303.08128.pdf
[150] Github: https://github.com/cvlab-columbia/viper
[151] ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问 : https://arxiv.org/pdf/2303.06594.pdf
[152] Github: https://github.com/Vision-CAIR/ChatCaptioner
[153] Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑 : https://arxiv.org/pdf/2303.04671.pdf
[154] Github: https://github.com/microsoft/TaskMatrix
[155] 演示: https://huggingface.co/spaces/microsoft/visual\_chatgpt
[156] Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器 : https://arxiv.org/pdf/2303.02151.pdf
[157] Github: https://github.com/ZrrSkywalker/CaFo
[158] PointCLIP V2: 适应强大的3D开放世界学习的CLIP : https://arxiv.org/pdf/2211.11682.pdf
[159] Github: https://github.com/yangyangyang127/PointCLIP\_V2
[160] Visual Programming: 无需训练的组合式视觉推理 : https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta\_Visual\_Programming\_Compositional\_Visual\_Reasoning\_Without\_Training\_CVPR\_2023\_paper.pdf
[161] Github: https://github.com/allenai/visprog
[162] Socratic Models: 使用语言进行零样本多模态推理 : https://arxiv.org/pdf/2204.00598.pdf
[163] Github: https://github.com/google-research/google-research/tree/master/socraticmodels
[164] Transfer Visual Prompt Generator across LLMs : https://arxiv.org/pdf/2305.01278.pdf
[165] Github: https://github.com/VPGTrans/VPGTrans
[166] 演示: https://3fc7715dbc44234a7f.gradio.live/
[167] GPT-4 技术报告 : https://arxiv.org/pdf/2303.08774.pdf
[168] PaLM-E: 一种具有身体感知的多模态语言模型 : https://arxiv.org/pdf/2303.03378.pdf
[169] 演示: https://palm-e.github.io/#demo
[170] Prismer: 具有多个专家的视觉语言模型 : https://arxiv.org/pdf/2303.02506.pdf
[171] Github: https://github.com/NVlabs/prismer
[172] 演示: https://huggingface.co/spaces/lorenmt/prismer
[173] 语言并非唯一需求:将感知与语言模型对齐 : https://arxiv.org/pdf/2302.14045.pdf
[174] Github: https://github.com/microsoft/unilm
[175] BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练 : https://arxiv.org/pdf/2301.12597.pdf
[176] Github: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
[177] 演示: https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2\_instructed\_generation.ipynb
[178] VIMA: 基于多模态提示的通用机器人操作 : https://arxiv.org/pdf/2210.03094.pdf
[179] Github: https://github.com/vimalabs/VIMA
[180] 大型预训练模型能帮助视觉模型处理感知任务吗? : https://arxiv.org/pdf/2306.00693.pdf
[181] 即将推出:
[182] 多模态大型语言模型在上下文目标检测中的应用 : https://arxiv.org/pdf/2305.18279.pdf
[183] Github: https://github.com/yuhangzang/ContextDET
[184] 演示: https://huggingface.co/spaces/yuhangzang/ContextDet-Demo
[185] 利用多模态语言模型生成图像 : https://arxiv.org/pdf/2305.17216.pdf
[186] Github: https://github.com/kohjingyu/gill
[187] 评估大型视觉-语言模型的对抗鲁棒性 : https://arxiv.org/pdf/2305.16934.pdf
[188] Github: https://github.com/yunqing-me/AttackVLM
[189] 在大型视觉-语言模型中评估对象虚构 : https://arxiv.org/pdf/2305.10355.pdf
[190] Github: https://github.com/RUCAIBox/POPE
[191] 将语言模型与图像进行模态间输入输出关联 : https://arxiv.org/pdf/2301.13823.pdf
[192] Github: https://github.com/kohjingyu/fromage
[193] 演示: https://huggingface.co/spaces/jykoh/fromage
[194] Microsoft COCO: Common Objects in Context: https://arxiv.org/pdf/1405.0312.pdf
[195] Im2Text: Describing Images Using 1 Million Captioned Photographs: https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
[196] Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning: https://aclanthology.org/P18-1238.pdf
[197] LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs: https://arxiv.org/pdf/2111.02114.pdf
[198] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations: https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
[199] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models: https://openaccess.thecvf.com/content\_iccv\_2015/papers/Plummer\_Flickr30k\_Entities\_Collecting\_ICCV\_2015\_paper.pdf
[200] AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding: https://arxiv.org/pdf/1711.06475.pdf
[201] Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark: https://proceedings.neurips.cc/paper\_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets\_and\_Benchmarks.pdf
[202] Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks: https://arxiv.org/pdf/2306.04362.pdf
[203] MSR-VTT: A Large Video Description Dataset for Bridging Video and Language: https://openaccess.thecvf.com/content\_cvpr\_2016/papers/Xu\_MSR-VTT\_A\_Large\_CVPR\_2016\_paper.pdf
[204] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval: https://arxiv.org/pdf/2104.00650.pdf
[205] WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research: https://arxiv.org/pdf/2303.17395.pdf
[206] AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline: https://arxiv.org/pdf/1709.05522.pdf
[207] AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale: https://arxiv.org/pdf/1808.10583.pdf
[208] X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf
[209] Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration: https://arxiv.org/pdf/2306.09093.pdf
[210] 链接: https://github.com/lyuchenyang/Macaw-LLM/tree/main/data
[211] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf
[212] 链接: https://github.com/OpenLAMM/LAMM#lamm-dataset
[213] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf
[214] 链接: https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open\_file\_folder
[215] MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf
[216] 即将推出: https://github.com/Luodian/Otter
[217] M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning: https://arxiv.org/pdf/2306.04387.pdf
[218] 链接: https://huggingface.co/datasets/MMInstruction/M3IT
[219] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day: https://arxiv.org/pdf/2306.00890.pdf
[220] 即将推出: https://github.com/microsoft/LLaVA-Med#llava-med-dataset
[221] GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction: https://arxiv.org/pdf/2305.18752.pdf
[222] 链接: https://github.com/StevenGrove/GPT4Tools#dataset
[223] ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst: https://arxiv.org/pdf/2305.16103.pdf
[224] 即将推出: https://iva-chatbridge.github.io/
[225] DetGPT: Detect What You Need via Reasoning: https://arxiv.org/pdf/2305.14167.pdf
[226] 链接: https://github.com/OptimalScale/DetGPT/tree/main/dataset
[227] PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering: https://arxiv.org/pdf/2305.10415.pdf
[228] 即将推出: https://xiaoman-zhang.github.io/PMC-VQA/
[229] VideoChat: Chat-Centric Video Understanding: https://arxiv.org/pdf/2305.06355.pdf
[230] 链接: https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction\_data
[231] X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf
[232] 链接: https://github.com/phellonchen/X-LLM
[233] LMEye: An Interactive Perception Network for Large Language Models: https://arxiv.org/pdf/2305.03701.pdf
[234] 链接: https://huggingface.co/datasets/YunxinLi/Multimodal\_Insturction\_Data\_V2
[235] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models: https://arxiv.org/pdf/2304.10592.pdf
[236] 链接: https://huggingface.co/datasets/Vision-CAIR/cc\_sbu\_align
[237] Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf
[238] 链接: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
[239] MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning: https://arxiv.org/pdf/2212.10773.pdf
[240] 链接: https://github.com/VT-NLP/MultiInstruct
[241] MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf
[242] 即将推出: https://github.com/Luodian/Otter
[243] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought: https://arxiv.org/pdf/2305.15021.pdf
[244] 即将推出: https://github.com/EmbodiedGPT/EmbodiedGPT\_Pytorch
[245] Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction: https://arxiv.org/pdf/2305.13903.pdf
[246] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering: https://proceedings.neurips.cc/paper\_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
[247] 链接: https://github.com/lupantech/ScienceQA#ghost-download-the-dataset
[248] IMAD: IMage-Augmented multi-modal Dialogue: https://arxiv.org/pdf/2305.10512.pdf
[249] 链接: https://github.com/VityaVitalich/IMAD
[250] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf
[251] 链接: https://github.com/OpenLAMM/LAMM#lamm-benchmark
[252] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality: https://arxiv.org/pdf/2304.14178.pdf
[253] 链接: https://github.com/X-PLUG/mPLUG-Owl/tree/main/OwlEval
[254] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf
[255] 链接: https://github.com/mbzuai-oryx/Video-ChatGPT#quantitative-evaluation-bar\_chart
[256] LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models: https://arxiv.org/pdf/2306.09265.pdf
[257] 链接: https://github.com/OpenGVLab/Multi-Modality-Arena
[258] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf
[259] 链接: https://drive.google.com/drive/folders/1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft
[260] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf
[261] 链接: https://drive.google.com/drive/folders/1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0
[262] GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

0
0
0
0
关于作者
关于作者

文章

0

获赞

0

收藏

0

相关资源
火山引擎大规模机器学习平台架构设计与应用实践
围绕数据加速、模型分布式训练框架建设、大规模异构集群调度、模型开发过程标准化等AI工程化实践,全面分享如何以开发者的极致体验为核心,进行机器学习平台的设计与实现。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论