llama405b 接近gpt4o，蒸馏后的8b、70b性能大幅飙升！ - 文章 - 开发者社区

书接上回，随着即将开源，leak的信息越来越多。包括权重已经可下载了，更多的仓库在hf传，但是随后又都被删掉了，网上还是能找到很多下载的地方，模型卡的存档快照还可以看到。今天可以看到一些详细信息，包括基准评测结果，秒天秒地！

Meta Llama 3.1 多语言大模型 (LLM) 集合是预训练和指令调整的生成模型的集合，大小为 8B、70B 和 405B（文本输入/文本输出）。Llama 3.1 指令调整的纯文本模型（8B、70B、405B）针对多语言对话用例 进行了优化，并且在常见行业基准上优于许多可用的开源和封闭式聊天模型。所有模型均为128k上下文，数据时间截止到2023年12月。

模型开发商：Meta

模型架构 ：Llama 3.1 是一种使用优化的 Transformer 架构的自回归语言模型。调整后的版本使用监督微调（SFT）和带有人类反馈的强化学习（RLHF）来符合人类对有用性和安全性的偏好。


	Training Data	Params	Input modalities	Output modalities	Context length	GQA	Token count	Knowledge cutoff
Llama 3.1 (text only)	A new mix of publicly available online data.	8B	Multilingual Text	Multilingual Text and code	128k	Yes	15T+	2023 年 12 月
70B	Multilingual Text	Multilingual Text and code	128k	Yes
405B	Multilingual Text	Multilingual Text and code	128k	Yes

使用自定义训练库、Meta 定制的 GPU 集群以及用于预训练的生产基础设施。还在生产基础设施上进行了微调、注释和评估。

训练能耗训练在 H100-80GB（TDP 为 700W）类型硬件上累计使用了 3930 万个 GPU 计算小时，如下表所示。训练时间是训练每个模型所需的总 GPU 时间，功耗是所使用的每个 GPU 设备的峰值功率容量，并根据功率使用效率进行调整。


	Training Time (GPU hours)	Training Power Consumption (W)	Training Location-Based Greenhouse Gas Emissions
(tons CO2eq)
Training Market-Based Greenhouse Gas Emissions
(tons CO2eq)

Llama 3.1 8B	1.46M	700	420	0
Llama 3.1 70B	7.0M	700	2,040	0
Llama 3.1 405B	30.84M	700	8,930	0
Total	39.3M		11,390	0

Benchmark scores

Base pretrained models


Category	Benchmark	# Shots	Metric	Llama 3 8B	Llama 3.1 8B	Llama 3 70B	Llama 3.1 70B	Llama 3.1 405B
General	MMLU	5	macro_avg/acc_char	66.7	66.7	79.5	79.3	85.2
MMLU PRO (CoT)	5	macro_avg/acc_char	36.2	37.1	55.0	53.8	61.6
AGIEval English	3-5	average/acc_char	47.1	47.8	63.0	64.6	71.6
CommonSenseQA	7	acc_char	72.6	75.0	83.8	84.1	85.8
Winogrande	5	acc_char	-	60.5	-	83.3	86.7
BIG-Bench Hard (CoT)	3	average/em	61.1	64.2	81.3	81.6	85.9
ARC-Challenge	25	acc_char	79.4	79.7	93.1	92.9	96.1
Knowledge reasoning	TriviaQA-Wiki	5	em	78.5	77.6	89.7	89.8	91.8
Reading comprehension	SQuAD	1	em	76.4	77.0	85.6	81.8	89.3
QuAC (F1)	1	f1	44.4	44.9	51.1	51.1	53.6
BoolQ	0	acc_char	75.7	75.0	79.0	79.4	80.0
DROP (F1)	3	f1	58.4	59.5	79.7	79.6	84.8

Instruction tuned models


Category	Benchmark	# Shots	Metric	Llama 3 8B Instruct	Llama 3.1 8B Instruct	Llama 3 70B Instruct	Llama 3.1 70B Instruct	Llama 3.1 405B Instruct
General	MMLU	5	macro_avg/acc	68.5	69.4	82.0	83.6	87.3
MMLU (CoT)	0	macro_avg/acc	65.3	73.0	80.9	86.0	88.6
MMLU PRO (CoT)	5	micro_avg/acc_char	45.5	48.3	63.4	65.1	73.3
IFEval			76.8	80.4	82.9	87.5	88.6
Reasoning	ARC-C	0	acc	82.4	83.4	94.4	94.8	96.9
GPQA	0	em	34.6	30.4	39.5	41.7	50.7
MuSR	0	correct	56.3	45.7	55.1	58.1	56.7
Code	HumanEval	0	pass@1	60.4	72.6	81.7	80.5	89.0
MBPP ++ base version	0	pass@1	70.6	72.8	82.5	86.0	88.6
Multipl-E HumanEval	0	pass@1		50.8		65.5	75.2
Multipl-E MBPP	0	pass@1		52.4		62.0	65.7
Math	GSM-8K (CoT)	8	em_maj1@1	80.6	84.5	93.0	95.1	96.8
MATH (CoT)	0	final_em	29.1	51.9	51.0	68.0	73.8
Tool Use	API-Bank	0	acc	83.6	82.6	85.1	90.0	92.0
Berkeley Function Calling	0	acc	76.1	76.1	83.0	85.1	88.5
Gorilla Benchmark API Bench	0	acc	8.8	8.2	14.7	29.7	35.3
Nexus (0-shot)	0	macro_avg/acc	37.6	38.5	47.8	56.7	58.7
Multilingual	Multilingual MGSM	8	em	-	68.2	-	85.6	90.3

Multilingual benchmarks


Category	Benchmark	Language	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B
General	MMLU (5-shot, macro_avg/acc)	Portuguese	62.12	80.13	84.95
Spanish	62.45	80.05	85.08
Italian	61.63	80.4	85.04
German	60.59	79.27	84.36
French	62.34	79.82	84.66
Hindi	50.88	74.52	80.31
Thai	50.32	72.95	78.21

详细地址：https://web.archive.org/web/20240722214257/https://huggingface.co/huggingface-test1/test-model-1

PS：给公众号添加【星标⭐️】不迷路！您的点赞、在看、关注 是我坚持的最大动力！

欢迎多多关注公众号「NLP前沿」，加入交流群，交个朋友吧，一起学习，一起进步！