这个世界上最知名的数据集,可能就是AI教母李飞飞于2009年创建的ImageNet了。因为这个项目的创立,以及之后与之关联的世界级视觉识别挑战赛(ILSVRC),大大推动了AI的发展,一路见证和验证了AI在视觉识别任务上,从差强人意到超越人类。我们今天在单卡4090上,用ImageNet数据集,训练下最新的CNN SOTA模型,看下效果如何。
在 HuggingFace 上有多个 imagenet 数据集。其中官方的是: ILSVRC/imagenet-1k 数据集,基本介绍:
ILSVRC 2012(通常被称为“ImageNet”)是一个根据WordNet层次结构组织的图像数据集。在WordNet中,每个有意义的概念(可能由多个词语或短语描述)被称为一个“同义词集”(synonym set),简称“synset”。WordNet中包含超过10万个synset,其中大多数是名词(超过8万个)。ImageNet的目标是为每个synset提供平均约1000张图片进行说明。每个概念的图像都经过质量控制并由人工标注。
官方原始的数据集比较大,有200多G;为了加快训练速度,控制成本,我们采用 timm/mini-imagenet 数据集,一个包含100个类别的ImageNet-1k精简版数据集(原1000类中的100类)。与其他"精简版"不同,这个数据集完整保留了原始图像尺寸(许多同类数据集会将图像降采样至84x84等低分辨率。)
数据划分
- • 训练集 50,000个样本(来自ImageNet-1k训练集)
- • 验证集 10,000个样本(来自ImageNet-1k训练集)
- • 测试集 5,000个样本(来自ImageNet-1k验证集,每类50个样本)
下载数据集:
nohup huggingface-cli download --resume-download \
--repo-type dataset timm/mini-imagenet \
--local-dir timm/mini-imagenet \
> download.log 2>&1 &
下载完成后可以看到,数据集目录:
mini-imagenet/
├── data
│ ├── test-00000-of-00002.parquet
│ ├── test-00001-of-00002.parquet
│ ├── train-00000-of-00013.parquet
│ ├── train-00001-of-00013.parquet
│ ├── train-00002-of-00013.parquet
│ ├── train-00003-of-00013.parquet
│ ├── train-00004-of-00013.parquet
│ ├── train-00005-of-00013.parquet
│ ├── train-00006-of-00013.parquet
│ ├── train-00007-of-00013.parquet
│ ├── train-00008-of-00013.parquet
│ ├── train-00009-of-00013.parquet
│ ├── train-00010-of-00013.parquet
│ ├── train-00011-of-00013.parquet
│ ├── train-00012-of-00013.parquet
│ ├── validation-00000-of-00003.parquet
│ ├── validation-00001-of-00003.parquet
│ └── validation-00002-of-00003.parquet
└── README.md
大小只有7G:
# du -sh mini-imagenet/
7.0G mini-imagenet/
查看其中一个 parquet 文件的元数据:
# parquet-tools inspect /data/ai/datasets/timm/mini-imagenet/data/train-00000-of-00013.parquet
############ file meta data ############
created\_by: parquet-cpp-arrow version 17.0.0
num\_columns: 3
num\_rows: 3847
num\_row\_groups: 39
format\_version: 2.6
serialized\_size: 16326
############ Columns ############
bytes
path
label
############ Column(bytes) ############
name: bytes
path: image.bytes
max\_definition\_level: 2
max\_repetition\_level: 0
physical\_type: BYTE\_ARRAY
logical\_type: None
converted\_type (legacy): NONE
compression: SNAPPY (space\_saved: 0%)
############ Column(path) ############
name: path
path: image.path
max\_definition\_level: 2
max\_repetition\_level: 0
physical\_type: BYTE\_ARRAY
logical\_type: String
converted\_type (legacy): UTF8
compression: SNAPPY (space\_saved: 66%)
############ Column(label) ############
name: label
path: label
max\_definition\_level: 1
max\_repetition\_level: 0
physical\_type: INT64
logical\_type: None
converted\_type (legacy): NONE
compression: SNAPPY (space\_saved: -4%)
启动并进入容器:
docker run --name train -itd \
--gpus '"device=4"' \
--shm-size=128gb \
-e LANG=C.UTF-8 -e LC\_ALL=C.UTF-8 \
-v /data/ai/datasets:/datasets \
-v /data/ai/workspace/train:/workspace \
pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel bash
docker exec -it train bash
在容器中安装依赖:
root@cb82d3b9af0a:/workspace# export PIP\_INDEX\_URL=https://mirrors.aliyun.com/pypi/simple/
root@cb82d3b9af0a:/workspace# pip install timm datasets tensorboard
以下是一个在24GB显存RTX 4090上,使用ImageNet-1k数据集,训练CNN最新的ConvNeXt架构模型的完整代码:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.utils.data.distributed
import torchvision
from torchvision import transforms
import torchvision.datasets as datasets
from torch.utils.data import DataLoader
from datasets import load\_dataset
from torch.amp import autocast, GradScaler
from timm.models import convnext\_base, ConvNeXt
from timm.data import Mixup
from timm.loss import SoftTargetCrossEntropy
from timm.optim import AdamP
from timm.scheduler import CosineLRScheduler
import math
import os
import time
import pyarrow.parquet as pq
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import io
from tqdm.auto import tqdm
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime
# 写 TensorBoard 数据的 writer
writer = SummaryWriter(log\_dir='runs/convnext-' + datetime.now().strftime("%m%d%H%M"))
class ParquetImageDataset(Dataset):
def \_\_init\_\_(self, parquet\_files, transform=None):
self.samples = []
for file in parquet\_files:
table = pq.read\_table(file).flatten() # 展平嵌套字段
columns = table.to\_pydict()
self.samples.extend(zip(columns['image.bytes'], columns['label']))
self.transform = transform
def \_\_len\_\_(self):
return len(self.samples)
def \_\_getitem\_\_(self, idx):
image\_bytes, label = self.samples[idx]
image = Image.open(io.BytesIO(image\_bytes)).convert("RGB")
if self.transform:
image = self.transform(image)
return image, label
# 配置参数
config = {
'epochs': 300,
'batch\_size': 160, # 4090在fp16下可运行此batch size
'lr': 4e-3,
'weight\_decay': 0.05,
'min\_lr': 2e-6,
'warmup\_epochs': 5,
'warmup\_lr': 1e-6,
'num\_classes': 1000,
'clip\_grad\_norm': 1.0,
'amp': True, # 自动混合精度(Automatic Mixed Precision, 混合FP16/32计算,提速降显存,但会引入值不稳定风险)
'num\_workers': 8,
'pin\_memory': True,
'drop\_path\_rate': 0.2,
'data\_path': '/datasets/timm/mini-imagenet/data'
}
# 初始化分布式训练(单机单卡)
torch.cuda.set\_device(0)
device = torch.device('cuda:0')
# 数据增强
train\_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
val\_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# 加载数据集
train\_files = [f"/datasets/timm/mini-imagenet/data/train-{i:05d}-of-00013.parquet" for i in range(13)]
val\_files = [f"/datasets/timm/mini-imagenet/data/validation-{i:05d}-of-00003.parquet" for i in range(3)]
train\_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
])
val\_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
])
# 构造 Dataset 和 DataLoader
train\_dataset = ParquetImageDataset(train\_files, transform=train\_transform)
val\_dataset = ParquetImageDataset(val\_files, transform=val\_transform)
train\_loader = DataLoader(
train\_dataset, batch\_size=160, shuffle=True,
persistent\_workers=True, prefetch\_factor=6,
num\_workers=8, pin\_memory=True
)
val\_loader = DataLoader(
val\_dataset, batch\_size=160, shuffle=False,
persistent\_workers=True, prefetch\_factor=6,
num\_workers=8, pin\_memory=True
)
# 初始化混合精度训练 PyTorch 2.4+ 支持自动混合精度的新写法
# scaler = GradScaler(enabled=config['amp'])
scaler = GradScaler(device='cuda', enabled=config['amp'])
# 开启 cuDNN Benchmark, 让 cuDNN 自动选择最优卷积算法,加快前向/后向。
torch.backends.cudnn.benchmark = True
# 初始化MixUp增强(以线性插值融合样本及其标签的数据增强方法)
mixup\_fn = Mixup(
mixup\_alpha=0.8,
cutmix\_alpha=1.0,
prob=1.0, # 每个 batch 都应用 MixUp(即 100% 概率);
switch\_prob=0.5, # 在 MixUp 与 CutMix 之间随机切换;
mode='batch',
label\_smoothing=0.1, # 对 soft label 再做 ϵ-平滑(防止极端概率为 0/1)。
num\_classes=config['num\_classes']
)
# 创建模型
model = convnext\_base(
pretrained=False,
drop\_path\_rate=config['drop\_path\_rate']
)
model = model.to(device)
model = torch.compile(model) # PyTorch 2.0编译加速
# 损失函数
criterion\_train = SoftTargetCrossEntropy().to(device)
criterion\_val = nn.CrossEntropyLoss().to(device)
# 优化器
optimizer = AdamP(
model.parameters(),
lr=config['lr'],
weight\_decay=config['weight\_decay']
)
# 学习率调度器
total\_steps = len(train\_loader) * config['epochs']
scheduler = CosineLRScheduler(
optimizer,
t\_initial=total\_steps,
warmup\_t=len(train\_loader) * config['warmup\_epochs'],
warmup\_lr\_init=config['warmup\_lr'],
warmup\_prefix=True,
lr\_min=config['min\_lr']
)
# 训练函数
def train\_epoch(epoch):
model.train()
total\_loss = 0.0
start\_time = time.time()
loop = tqdm(enumerate(train\_loader), total=len(train\_loader), desc=f"Epoch {epoch}", unit="batch")
for step, (inputs, targets) in loop:
inputs, targets = inputs.to(device), targets.to(device)
inputs, targets = mixup\_fn(inputs, targets)
# 混合精度
with autocast('cuda', enabled=config['amp']):
outputs = model(inputs)
loss = criterion\_train(outputs, targets)
if not torch.isfinite(loss):
print(f"----- Warning: skipped NaN loss at step {step}")
continue
scaler.scale(loss).backward()
scaler.unscale\_(optimizer)
torch.nn.utils.clip\_grad\_norm\_(model.parameters(), config['clip\_grad\_norm'])
scaler.step(optimizer)
scaler.update()
optimizer.zero\_grad()
# 学习率调度
scheduler.step(step + epoch * len(train\_loader))
# 累积 loss
total\_loss += loss.item()
# TensorBoard:按 step 写入 train loss 与当前 lr
global\_step = epoch * len(train\_loader) + step
writer.add\_scalar('train/loss', loss.item(), global\_step)
writer.add\_scalar('train/lr', optimizer.param\_groups[0]['lr'], global\_step)
# 更新进度条信息
avg\_loss = total\_loss / (step + 1)
loop.set\_postfix({
"loss": f"{loss.item():.4f}",
"avg": f"{avg\_loss:.4f}",
"lr": f"{optimizer.param\_groups[0]['lr']:.6f}"
})
epoch\_loss = total\_loss / len(train\_loader)
writer.add\_scalar('train/epoch\_loss', epoch\_loss, epoch)
elapsed = time.time() - start\_time
print(f"Train Epoch: {epoch} | Average Loss: {epoch\_loss:.4f} | Time: {elapsed:.1f}s")
# 验证函数
@torch.no\_grad()
def validate(epoch=None):
model.eval()
total\_loss = 0
correct = 0
total = 0
for inputs, targets in val\_loader:
inputs, targets = inputs.to(device), targets.to(device)
with autocast('cuda', enabled=config['amp']):
outputs = model(inputs)
loss = criterion\_val(outputs, targets)
total\_loss += loss.item() * inputs.size(0)
\_, predicted = outputs.max(1)
correct += predicted.eq(targets).sum().item()
total += targets.size(0)
val\_loss = total\_loss / total
val\_acc = 100. * correct / total
writer.add\_scalar('val/loss', val\_loss, epoch)
writer.add\_scalar('val/accuracy', val\_acc, epoch)
print(f'\nValidation: Loss: {val\_loss:.4f} | Acc: {val\_acc:.2f}%\n')
return val\_acc
# 主训练循环
best\_acc = 0.0
for epoch in range(config['epochs']):
train\_epoch(epoch)
# 每个epoch后验证
if epoch % 10 == 0 or epoch == config['epochs'] - 1:
acc = validate(epoch)
# 保存最佳模型
if acc > best\_acc:
best\_acc = acc
torch.save({
'epoch': epoch,
'model\_state\_dict': model.state\_dict(),
'optimizer\_state\_dict': optimizer.state\_dict(),
'acc': acc,
}, f'best\_model.pth')
print(f'Best Accuracy: {best\_acc:.2f}%')
# 训练后评估
model.load\_state\_dict(torch.load('best\_model.pth', weights\_only=True)['model\_state\_dict'])
acc = validate()
print(f'Final Model Accuracy: {acc:.2f}%')
以上代码集合了当前ImageNet训练的最佳实践,其中:
- • 使用了高性能的 ConvNeXt-Base 作为基础模型架构,这是当前ImageNet上CNN模型的SOTA
- • 添加了 随机深度(DropPath) 正则化技术
- • 采用了 混合精度训练(AMP) :降低显存需求加速训练
- • 采用了 AdamP优化器 :比标准Adam更适合视觉任务
- • 学习率调度 :采用带warmup的余弦退火
- • 采用了 MixUp :增强正则化效果
- • 通过 torch.compile:图形编译加速模型训练
- • 批次调优:充分利用了4090的24GB显存和算力
以上代码采用的基础模型架构 ConvNeXt, 源码地址在这里: https://github.com/huggingface/pytorch-image-models 其中 convnext_base 的创建逻辑在 timm/models/convnext.py 中:
@register\_model
def convnext\_base(pretrained=False, **kwargs):
model\_args = dict(
depths=[3, 3, 27, 3], # 每个 Stage 的 block 数量
dims=[128, 256, 512, 1024], # 每个 Stage 的通道数
**kwargs
)
model = \_create\_convnext('convnext\_base', pretrained=pretrained, **model\_args)
return model
- • "convnext_base" 使用 4 个 stage,分别包含 3、3、27、3 个 block;
- • 对应通道维为 128、256、512、1024;
- • "kwargs" 包含用户传入的 "drop_path_rate" 等参数 ([github.com][1], [blog.csdn.net][2])。
对应的网络结构如下:
1. Patch‑stem
- • 第一个 "Conv2d(in_ch=3, out_ch=128, kernel_size=4, stride=4)";
- • 紧接 "LayerNorm";
- • 将原始 224×224 → 56×56。
2. 四大 Stage
每个 stage 包含:
- • Downsampling (除第一 stage,i > 0):
- • "LayerNorm(prev_chs)";
- • "Conv2d(prev_chs → dims[i], kernel_size=2, stride=2)";
- • 输出 spatial 尺寸减半,通道翻倍。
- • 多个 ConvNeXtBlock(depths 定义数量):
Depthwise Conv (7×7) :扩感受野;
LayerNorm (channels-last 实现);
MLP (Pointwise 卷积或 Linear,结构为 GELU → Linear),通常通道扩展 4 倍;
LayerScale :可学习的缩放因子
,初始化为小值; 5. 5. DropPath(Stochastic Depth) ;
Residual Add 。
- • 每层 stage 的总 block 数为 "[3,3,27,3]",dp_rate 在 "[0, drop_path_rate]" 区间随着 block 累增 ([blog.csdn.net][2])。
参数细节:
| 参数 | Base 设置 | | depths | [3, 3, 27, 3] | | dims | [128, 256, 512, 1024] | | drop_path_rate | 用户传入配置值,如 0.5 | | patch_size | 4 | | ls_init_value | 默认为 1e-6(LayerScale 初始化) |
这个模型结构有如下优点:
-
- 大卷积核(7×7 depthwise),提供 Transformer 式的局部视野;
-
- 简化的归一化(LayerNorm),更稳定高效;
-
- LayerScale + Stochastic Depth, 有助于训练深层网络;
-
- Patch-stem 与轻量化 head,保留传统 ConvNet 优势同时获得 Transformer 灵活性。
在容器中执行如下命令,启动训练:
nohup python cnn-ImageNet.py > cnn-ImageNet.log 2>&1 &
tail -f cnn-ImageNet.log
资源消耗:
batch_size 从 128 修改成 160 之后,显存:21112MiB / 24564MiB 算力 70%+ 到 90%+ 波动, 一个 epoch 的时间稳定在: Time: 87.5s 左右
训练日志:
。。。
Epoch 290: 100%|██████████| 313/313 [01:27<00:00, 3.57batch/s, loss=1.6936, avg=2.0393, lr=0.000023]
Train Epoch: 290 | Average Loss: 2.0393 | Time: 87.7s
Validation: Loss: 0.8412 | Acc: 83.64%
Best Accuracy: 83.64%
Epoch 291: 100%|██████████| 313/313 [01:27<00:00, 3.58batch/s, loss=2.8667, avg=2.0122, lr=0.000021]
Train Epoch: 291 | Average Loss: 2.0122 | Time: 87.4s
Epoch 292: 100%|██████████| 313/313 [01:28<00:00, 3.55batch/s, loss=1.7073, avg=2.0276, lr=0.000018]
Train Epoch: 292 | Average Loss: 2.0276 | Time: 88.2s
Epoch 293: 100%|██████████| 313/313 [01:27<00:00, 3.57batch/s, loss=1.8093, avg=1.9481, lr=0.000015]
Train Epoch: 293 | Average Loss: 1.9481 | Time: 87.8s
Epoch 294: 100%|██████████| 313/313 [01:27<00:00, 3.57batch/s, loss=1.3597, avg=1.9482, lr=0.000013]
Train Epoch: 294 | Average Loss: 1.9482 | Time: 87.7s
Epoch 295: 100%|██████████| 313/313 [01:27<00:00, 3.56batch/s, loss=2.3218, avg=1.9985, lr=0.000011]
Train Epoch: 295 | Average Loss: 1.9985 | Time: 88.0s
Epoch 296: 100%|██████████| 313/313 [01:27<00:00, 3.57batch/s, loss=2.6687, avg=2.0095, lr=0.000009]
Train Epoch: 296 | Average Loss: 2.0095 | Time: 87.6s
Epoch 297: 100%|██████████| 313/313 [01:27<00:00, 3.58batch/s, loss=1.6288, avg=2.0022, lr=0.000007]
Train Epoch: 297 | Average Loss: 2.0022 | Time: 87.5s
Epoch 298: 100%|██████████| 313/313 [01:27<00:00, 3.58batch/s, loss=1.9060, avg=2.0297, lr=0.000006]
Train Epoch: 298 | Average Loss: 2.0297 | Time: 87.4s
Epoch 299: 100%|██████████| 313/313 [01:27<00:00, 3.58batch/s, loss=2.7793, avg=2.0196, lr=0.000005]
Train Epoch: 299 | Average Loss: 2.0196 | Time: 87.5s
Validation: Loss: 0.8486 | Acc: 83.55%
Best Accuracy: 83.64%
Validation: Loss: 0.8412 | Acc: 83.64%
Final Model Accuracy: 83.64%
可以看到,最终在 ImageNet-1k 的数据集上,能达到 83.64% 的准确率
TensorBoard 图表: