点击上方👆蓝字关注我们!
近年来,字节跳动基础架构团队持续在 AI for Infra/System 布局,旨在使用 AI 技术优化云计算系统,并取得了显著成果。2025 年刚刚过去 4 个月,基础架构 ByteBrain 团队已经有 11 篇论文在 AI for Infra 领域的顶会发表或接收,其中 CCF-A 类会议 10 篇(SIGMODx3, VLDBx4, EuroSys, FSE, WWW 各 1 篇),ICLR 1 篇(ICLR 暂未进入 CCF 列表,但是公认的机器学习三大顶会之一)。
学术论文仅仅是 ByteBrain 团队的副产出,工业界最重要的是业务收益。ByteBrain 利用大模型(LLM)优化火山引擎稳定性,重要 oncall 提效 26%,基于运筹优化算法对系统成本进行优化,近三年节省成本超 10 亿人民币。除此之外,ByteBrain 还在异常检测,根因分析,AI for DB,DB for AI,Text2SQL, LLM Multi-Agent 等方向取得了较好进展,例如把预训练语言模型应用在 NDV(Number of Distinct Values)预测上,可以无需采样数据进行 NDV 估计,该项技术是领域内第一个基于语言模型进行 NDV 估计的工作,可以在无需访问原始数据的情况下达到开箱即用的效果,成果发表在SIGMOD25,并正集成到生产环境中。
在 AI 时代,字节跳动把大模型等相关技术规模化应用在了云计算和 IT 基础设施的优化中,并乐于分享最新的研究成果,反馈在开源社区和顶级学术会议上(详见本文附录)。这些成果的发表也表明字节跳动正走在该领域(AI for Infra)的前列。
关于字节跳动 ByteBrain 团队
ByteBrain 是字节跳动的 AI for Infra 服务平台,旨在利用 AI,特别是机器学习、大模型和运筹优化技术,对基础架构和系统的全生命周期进行自动优化。优化对象包括:数据库、存储、大数据系统、虚机、容器、网络、运维和稳定性等。ByteBrain 的主要方向为 AIOPS、AI4DB、运筹优化、LLM4Infra 四大方向,功能模块包括容量规划、资源调度、系统调参、异常检测、根因分析、慢 SQL 优化、Text2SQL、LLM-AGENT 等。
ByteBrain 团队正在招聘相关方向的研究员和实习生,联系方式:tieying.zhang@bytedance.com
截止 25 年 4 月份 ByteBrain 团队的学术论文(* corresponding author):
- PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models:https://arxiv.org/pdf/2504.00608
SIGMOD, 2025
Xianghong Xu, Xiao He, Tieying Zhang*, Rui Shi, Lei Zhang, Jianjun Chen - AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators:https://arxiv.org/pdf/2502.16190
VLDB, 2025
Xianghong Xu, Tieying Zhang*, Xiao He, Haoyang Li, Rong Kang, Shuai Wang, Linhui Xu, Zhimin Liang, Shangyu Luo, Lei Zhang, Jianjun Chen - Adaptive and Efficient Log Parsing as a Cloud Service:https://www.arxiv.org/pdf/2504.09113
SIGMOD, 2025
Zeyan Li, Jie Song, Tieying Zhang*, Tao Yang, Yingjie Ye, Pengfei Duan, Jianjun Chen - Data-Agnostic Cardinality Learning from Imperfect Workloads
VLDB, 2025
Peizhi Wu, Rong Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives - TickIt: Leveraging Large Language Models for Automated Ticket Escalation:https://arxiv.org/pdf/2504.08475
FSE, 2025
Fengrui Liu, Xiao He, Tieying Zhang*, Jianjun Chen, Yi Li, Lihua Yi, Haipeng Zhang, Gang Wu, Rui Shi - ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning:https://arxiv.org/pdf/2412.03104
VLDB, 2025
Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang*, Jianjun Chen, Rui Shi, Dan Pei* - Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis:https://www.arxiv.org/pdf/2502.08224
WWW, 2025
Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang*, Jianjun Chen, Jianhui Li*, Gaogang Xie, Dan Pei - E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model:https://arxiv.org/pdf/2404.11581
VLDB, 2025
Xinmei Huang, Haoyang Li, Jing Zhang*, Xinxin Zhao, Zhiming Yao, Yiyan Li, Tieying Zhang*, Jianjun Chen, Hong Chen, Cuiping Li - Learning to Communicate Through Implicit Communication Channels:https://arxiv.org/pdf/2411.01553
ICLR, 2025
Han Wang, Binbin chen, Tieying Zhang, Baoxiang Wang - ABase: The Multi-Tenant NoSQL Serverless Database for Diverse and Dynamic Workloads in Large-scale Cloud Environments
SIGMOD, 2025
Rong Kang, Yanbin Chen, Ye Liu, Fuxin Jiang, Qingshuo Li, Miao Ma, Jian Liu, Guangling Zhao, Tieying Zhang, Jianjun Chen, Lei Zhang - Towards VM Rescheduling Optimization Through Deep Reinforcement Learning:https://drive.google.com/file/d/1mKMh0HUMSu1JsUhtbck4pnZgO11VBopJ/view
EuroSys, 2025
Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang*, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du*