一款完成度较高的开源论文保排版翻译软件 - 文章 - 开发者社区

基本信息：

源码：GitHub - Byaidu/PDFMathTranslate: 基于 AI 完整保留排版的 PDF 文档全文双语翻译，支持 Google/DeepL/Ollama/OpenAI 等服务： https://github.com/Byaidu/PDFMathTranslate 已经有24K star了
相关模型：wybxc/DocLayout-YOLO-DocStructBench-onnx 文档中有提到，但是本文使用中貌似并未用到。

部署方式

本地安装

以隔离环境 conda 为例：

  
conda create -n pdf2zh python=3.12  
conda activate pdf2zh  
pip install pdf2zh

如果用源码安装，则：

  
git clone https://github.com/Byaidu/PDFMathTranslate.git  && PDFMathTranslate  
pip install .

安装后，查看版本：

  
➜  pip list | grep pdf  
gradio_pdf                     0.0.22  
pdf2zh                         1.9.10  
pdfminer.six                   20250416  
pikepdf                        9.8.1

容器部署

启动容器：

  
docker run -itd --name pdf2zh  \  
  -v /data/ai/workspace:/workspace \  
  -p 7860:7860  byaidu/pdf2zh:latest  
  
# 进入容器  
docker exec -it pdf2zh bash  
root@9fc66b81e22b:/app#

刚启动时容器中的默认配置文件：

  
root@9fc66b81e22b:/app# cat ~/.config/PDFMathTranslate/config.json  
{  
    "PDF2ZH_LANG_FROM": "English",  
    "PDF2ZH_LANG_TO": "Simplified Chinese",  
    "PDF2ZH_VFONT": ""  
}

这个文件后面会随着使用变化。会存储使用中的配置

k8s部署

k8s yaml 的关键部分：

  
    spec:  
      containers:  
        - name: pdf2zh  
          image: byaidu/pdf2zh:latest  
          ports:  
            - containerPort: 7860  
          volumeMounts:  
            - name: workspace-volume  
              mountPath: /workspace  
          command: ["sh", "-c", "pdf2zh -i"]

使用方式

命令行使用

调用 deepseek-r1 官网 API 来翻译：

  
DEEPSEEK_API_KEY=sk-xxx \  
DEEPSEEK_MODEL=deepseek-reasoner \  
pdf2zh -s deepseek -p 1 "/path/to/2504.16084v2.pdf"

调用本地部署的 Qwen3-32B-AWQ 来翻译：

  
OPENAILIKED_BASE_URL=http://10.96.0.180:7869/v1 \  
OPENAILIKED_API_KEY=sk-xxx \  
OPENAILIKED_MODEL=qwen3 \  
pdf2zh -s openailiked /path/to/2504.16084v2.pdf

调用本地模型时，我们能看到发给大模型的 Prompt 内容：

INFO 06-10 09:27:05 [logger.py:39] Received request chatcmpl-39a65288-24af-96ab-bb73-304408bed474: prompt: '<|im_start|>user\nYou are a professional, authentic machine translation engine. Only Output the translated text, do not include any other text.\n\nTranslate the following markdown source text to zh. Keep the formula notation {v*} unchanged. Output translation directly without any additional text.\n\nSource Text: This paper investigates Reinforcement Learning (RL) on data {v19} ex- plicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learn- ing (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approxi- mately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demon- strated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL’s potential for broader tasks and domains.\n\nTranslated Text:<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16234, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, pr

UI界面使用

pdf2zh -i 启动webui服务后，访问对应地址，可以看到简单界面。界面可以上传pdf文件，选择 Service，即用哪个大模型推理服务做翻译。比如选择 DeepSeek 的 V3 模型，填入如下内容：

picture.image

效果比较

原文：

picture.image

用 deepseek-chat（对应 DeepSeek-V3-0324）的翻译结果：

picture.image

快思考直出的翻译，质量还是不太好。比如将 TTS 翻译成了“测试时间缩放”

用 deepseek-reasoner （对应 DeepSeek-R1-0528）的翻译结果：

picture.image

用了慢思考的R1推理后，翻译效果明显提升了

用双卡4090上部署的 Qwen3-32B-AWQ 的翻译结果：

picture.image

这个结果看起来比 DeepSeek-R1 翻译得更好。。。缺点是 pass@1 做为术语没有保留，翻译成了通过率@1 ，不太妥当。

上面几个翻译，有个共同的问题，就是 without 这个单词没有进行翻译，而是把它当成了一个数据集名称。这倒不是大模型的错，是 PDFMathTranslate 的一个问题。 PDF原文的 without 这个单词，为了强调，用了斜体字，造成程序认为这是一个图片元素，因此在给大模型的 Prompt 中跳过了单词without，而以一个占位符 {v19} 代替。这句话发给大模型的实际内容为：This paper investigates Reinforcement Learning (RL) on data {v19} ex- plicit labels for reasoning tasks in Large Language Models (LLMs)，在前面的提示词中，让大模型不要动这个占位符，然后在翻译结果中，又将这个占位符，替换为这个“without”元素。造成几个模型最终的翻译都出了问题。这显然是弄巧成拙了。要么用力过猛“不识别这么细，不区分段落文字中的特殊元素，可能就好了。比如字节新出的Dolphin模型，就会忽略这个差异，直接整段提取为相同文字。要么火候未到：当识别为特殊字体/元素时，如其中仍然是文字，再递归一下带回去，结果内容再把格式啥的应用下，就更完美了，不过实现也就会相当复杂了。

将截图发给元宝，识别后调用 R1 翻译的结果如下：

摘要

本文研究在大型语言模型（LLMs）的推理任务中，对没有明确标签数据进行强化学习（Reinforcement Learning, RL）的方法。该问题的核心挑战在于推理过程中，在无法获取真实答案（ground-truth）信息的情况下进行奖励估计（reward estimation）。尽管这一场景似乎困难重重，但我们发现，测试时缩放（Test-Time Scaling, TTS）中常见的做法（例如多数投票（majority voting））能够产生令人惊讶的有效奖励（reward），这些奖励足以驱动强化学习的训练。

在这项工作中，我们提出了测试时强化学习（Test-Time Reinforcement Learning, TTRL），这是一种利用未标注数据对大型语言模型进行强化学习的新方法。TTRL通过利用预训练模型中的先验知识，实现了大型语言模型的自我进化（self-evolution）。

我们的实验表明，TTRL在各种任务和模型上都能持续提升性能。值得注意的是，TTRL仅使用未标注的测试数据，就将 Qwen-2.5-Math-7B 模型在 AIME 2024 竞赛上的 pass@1 性能提升了约 211%。此外，尽管TTRL 仅由 maj@n 指标作为监督信号，但它展现的性能能够持续超越初始模型 maj@n 指标的上限，并接近直接使用带真实标签测试数据训练模型的性能。

我们的实验结果验证了 TTRL 在各种任务中普遍有效性，并凸显了 TTRL 在更广泛任务和领域中应用的潜力。

这个翻译才更加准确

看下排版的例子：

原文：

picture.image

用 Qwen3-32B-AWQ 的翻译结果：

picture.image

中英文布局基本能保持相同，产品化程度相当高了。