文档智能解析方案总结进展更新（含ocr-pipline、layout+VLM+纯多模态端到端解析）

大模型机器学习算法

最近又新增了很多文档解析的开源项目，现再更新一下进展。里面提到的很多模型技术方案都在《文档智能专栏》

OCR-Pipline式文档解析（layout+阅读顺序+ocr专家小模型）

picture.image

MinerU1.x: https://github.com/opendatalab/MinerU
ppstructure: https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/algorithm/PP-StructureV3/PP-StructureV3.md
Docling: https://github.com/docling-project/docling
Marker: https://github.com/VikParuchuri/marker

...

总结：ocr-pipline的可解释性强，更贴近落地解法，但泛化能力有限

Layout+VLM

picture.image

MinerU2.5（1.2B）: https://github.com/opendatalab/MinerU
MonkeyOCR（1.2B~3B）：https://github.com/Yuliang-Liu/MonkeyOCR
PaddleOCR-VL（0.9B）：https://github.com/PaddlePaddle/PaddleOCR
chandra（8B）：https://github.com/datalab-to/chandra

这里面有些是传统的目标检测模型+VLM解析各部分内容，有些是检测+识别都一个模型干了。

多模态端到端的文档解析（finetune）

picture.image

Dolphin: https://github.com/bytedance/Dolphin
olmOCR: https://github.com/allenai/olmocr
GOT-OCR: https://github.com/Ucas-HaoranWei/GOT-OCR2.0
SmolDocling: https://huggingface.co/ds4sd/SmolDocling-256M-preview
Unstructured: https://github.com/Unstructured-IO/unstructured
OpenParse: https://github.com/Filimoa/open-parse
Mistral-OCR: https://mistral.ai/news/mistral-ocr?utm\_source=ai-bot.cn
Nougat: https://github.com/facebookresearch/nougat
DeepSeek-OCR：https://github.com/deepseek-ai/DeepSeek-OCR

...

通用多模态大模型代表

GPT4o
Gemini
Qwen2.5-VL-72B

...

0

0

0

0

关于作者

关于作者

文章

0

获赞

0

收藏

0

评论

未登录

暂无评论