书接上回,随着即将开源,leak的信息越来越多。包括权重已经可下载了,更多的仓库在hf传,但是随后又都被删掉了,网上还是能找到很多下载的地方,模型卡的存档快照还可以看到。今天可以看到一些详细信息,包括基准评测结果,秒天秒地!
Meta Llama 3.1 多语言大模型 (LLM) 集合是预训练和指令调整的生成模型的集合,大小为 8B、70B 和 405B(文本输入/文本输出)。Llama 3.1 指令调整的纯文本模型(8B、70B、405B)针对多语言对话用例 进行了优化,并且在常见行业基准上优于许多可用的开源和封闭式聊天模型。所有模型均为128k上下文,数据时间截止到2023年12月。
模型开发商:Meta
模型架构 :Llama 3.1 是一种使用优化的 Transformer 架构的自回归语言模型。调整后的版本使用监督微调(SFT)和带有人类反馈的强化学习(RLHF)来符合人类对有用性和安全性的偏好。
| Training Data | Params | Input modalities | Output modalities | Context length | GQA | Token count | Knowledge cutoff | |
| Llama 3.1 (text only) | A new mix of publicly available online data. | 8B | Multilingual Text | Multilingual Text and code | 128k | Yes | 15T+ | 2023 年 12 月 |
| 70B | Multilingual Text | Multilingual Text and code | 128k | Yes | ||||
| 405B | Multilingual Text | Multilingual Text and code | 128k | Yes |
使用自定义训练库、Meta 定制的 GPU 集群以及用于预训练的生产基础设施。还在生产基础设施上进行了微调、注释和评估。
训练能耗 训练在 H100-80GB(TDP 为 700W)类型硬件上累计使用了 3930 万个 GPU 计算小时,如下表所示。训练时间是训练每个模型所需的总 GPU 时间,功耗是所使用的每个 GPU 设备的峰值功率容量,并根据功率使用效率进行调整。
| Training Time (GPU hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions | ||
| (tons CO2eq) | ||||
| Training Market-Based Greenhouse Gas Emissions | ||||
| (tons CO2eq) | ||||
| Llama 3.1 8B | 1.46M | 700 | 420 | 0 |
| Llama 3.1 70B | 7.0M | 700 | 2,040 | 0 |
| Llama 3.1 405B | 30.84M | 700 | 8,930 | 0 |
| Total | 39.3M | 11,390 | 0 |
Benchmark scores
Base pretrained models
| Category | Benchmark | # Shots | Metric | Llama 3 8B | Llama 3.1 8B | Llama 3 70B | Llama 3.1 70B | Llama 3.1 405B |
| General | MMLU | 5 | macro_avg/acc_char | 66.7 | 66.7 | 79.5 | 79.3 | 85.2 |
| MMLU PRO (CoT) | 5 | macro_avg/acc_char | 36.2 | 37.1 | 55.0 | 53.8 | 61.6 | |
| AGIEval English | 3-5 | average/acc_char | 47.1 | 47.8 | 63.0 | 64.6 | 71.6 | |
| CommonSenseQA | 7 | acc_char | 72.6 | 75.0 | 83.8 | 84.1 | 85.8 | |
| Winogrande | 5 | acc_char | - | 60.5 | - | 83.3 | 86.7 | |
| BIG-Bench Hard (CoT) | 3 | average/em | 61.1 | 64.2 | 81.3 | 81.6 | 85.9 | |
| ARC-Challenge | 25 | acc_char | 79.4 | 79.7 | 93.1 | 92.9 | 96.1 | |
| Knowledge reasoning | TriviaQA-Wiki | 5 | em | 78.5 | 77.6 | 89.7 | 89.8 | 91.8 |
| Reading comprehension | SQuAD | 1 | em | 76.4 | 77.0 | 85.6 | 81.8 | 89.3 |
| QuAC (F1) | 1 | f1 | 44.4 | 44.9 | 51.1 | 51.1 | 53.6 | |
| BoolQ | 0 | acc_char | 75.7 | 75.0 | 79.0 | 79.4 | 80.0 | |
| DROP (F1) | 3 | f1 | 58.4 | 59.5 | 79.7 | 79.6 | 84.8 |
Instruction tuned models
| Category | Benchmark | # Shots | Metric | Llama 3 8B Instruct | Llama 3.1 8B Instruct | Llama 3 70B Instruct | Llama 3.1 70B Instruct | Llama 3.1 405B Instruct |
| General | MMLU | 5 | macro_avg/acc | 68.5 | 69.4 | 82.0 | 83.6 | 87.3 |
| MMLU (CoT) | 0 | macro_avg/acc | 65.3 | 73.0 | 80.9 | 86.0 | 88.6 | |
| MMLU PRO (CoT) | 5 | micro_avg/acc_char | 45.5 | 48.3 | 63.4 | 65.1 | 73.3 | |
| IFEval | 76.8 | 80.4 | 82.9 | 87.5 | 88.6 | |||
| Reasoning | ARC-C | 0 | acc | 82.4 | 83.4 | 94.4 | 94.8 | 96.9 |
| GPQA | 0 | em | 34.6 | 30.4 | 39.5 | 41.7 | 50.7 | |
| MuSR | 0 | correct | 56.3 | 45.7 | 55.1 | 58.1 | 56.7 | |
| Code | HumanEval | 0 | pass@1 | 60.4 | 72.6 | 81.7 | 80.5 | 89.0 |
| MBPP ++ base version | 0 | pass@1 | 70.6 | 72.8 | 82.5 | 86.0 | 88.6 | |
| Multipl-E HumanEval | 0 | pass@1 | 50.8 | 65.5 | 75.2 | |||
| Multipl-E MBPP | 0 | pass@1 | 52.4 | 62.0 | 65.7 | |||
| Math | GSM-8K (CoT) | 8 | em_maj1@1 | 80.6 | 84.5 | 93.0 | 95.1 | 96.8 |
| MATH (CoT) | 0 | final_em | 29.1 | 51.9 | 51.0 | 68.0 | 73.8 | |
| Tool Use | API-Bank | 0 | acc | 83.6 | 82.6 | 85.1 | 90.0 | 92.0 |
| Berkeley Function Calling | 0 | acc | 76.1 | 76.1 | 83.0 | 85.1 | 88.5 | |
| Gorilla Benchmark API Bench | 0 | acc | 8.8 | 8.2 | 14.7 | 29.7 | 35.3 | |
| Nexus (0-shot) | 0 | macro_avg/acc | 37.6 | 38.5 | 47.8 | 56.7 | 58.7 | |
| Multilingual | Multilingual MGSM | 8 | em | - | 68.2 | - | 85.6 | 90.3 |
Multilingual benchmarks
| Category | Benchmark | Language | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B |
| General | MMLU (5-shot, macro_avg/acc) | Portuguese | 62.12 | 80.13 | 84.95 |
| Spanish | 62.45 | 80.05 | 85.08 | ||
| Italian | 61.63 | 80.4 | 85.04 | ||
| German | 60.59 | 79.27 | 84.36 | ||
| French | 62.34 | 79.82 | 84.66 | ||
| Hindi | 50.88 | 74.52 | 80.31 | ||
| Thai | 50.32 | 72.95 | 78.21 | ||
详细地址:https://web.archive.org/web/20240722214257/https://huggingface.co/huggingface-test1/test-model-1
PS:给公众号添加【星标⭐️】不迷路!您的点赞、在看、关注 是我坚持的最大动力!
欢迎多多关注公众号「NLP前沿」,加入交流群,交个朋友吧,一起学习,一起进步!
最新文章推荐阅读
