最近又新增了很多文档解析的开源项目,现再更新一下进展。里面提到的很多模型技术方案都在《文档智能专栏》
OCR-Pipline式文档解析(layout+阅读顺序+ocr专家小模型)
- MinerU1.x: https://github.com/opendatalab/MinerU
- ppstructure: https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/algorithm/PP-StructureV3/PP-StructureV3.md
- Docling: https://github.com/docling-project/docling
- Marker: https://github.com/VikParuchuri/marker
...
总结:ocr-pipline的可解释性强,更贴近落地解法,但泛化能力有限
Layout+VLM
- MinerU2.5(1.2B): https://github.com/opendatalab/MinerU
- MonkeyOCR(1.2B~3B):https://github.com/Yuliang-Liu/MonkeyOCR
- PaddleOCR-VL(0.9B):https://github.com/PaddlePaddle/PaddleOCR
- chandra(8B):https://github.com/datalab-to/chandra
这里面有些是传统的目标检测模型+VLM解析各部分内容,有些是检测+识别都一个模型干了。
多模态端到端的文档解析(finetune)
- Dolphin: https://github.com/bytedance/Dolphin
- olmOCR: https://github.com/allenai/olmocr
- GOT-OCR: https://github.com/Ucas-HaoranWei/GOT-OCR2.0
- SmolDocling: https://huggingface.co/ds4sd/SmolDocling-256M-preview
- Unstructured: https://github.com/Unstructured-IO/unstructured
- OpenParse: https://github.com/Filimoa/open-parse
- Mistral-OCR: https://mistral.ai/news/mistral-ocr?utm\_source=ai-bot.cn
- Nougat: https://github.com/facebookresearch/nougat
- DeepSeek-OCR:https://github.com/deepseek-ai/DeepSeek-OCR
...
通用多模态大模型代表
- GPT4o
- Gemini
- Qwen2.5-VL-72B
...
