Image Classification

Posted Feb 24, 2025

By Kaige Zhang

7 min read

Objective

图像分类

用卷积神经网络解决图像分类问题。
通过数据增强来提高性能。
了解流行的图像模型技术，如残差。

Task Introduction

食物分类

图像是从food-11数据集中收集的，分为11类。
Training set: 9866 labeled images
Validation set: 3430 labeled images
Testing set: 3347 images

思路

Sample Baseline

Score: 0.63047 Private score: 0.61416 (n_epochs = 10)

直接跑一边Sample Code提交结果，如果没到Sample Baseline，就多train几个epoch。从训练数据来看，大约训练5个epoch就可以到Sample Baseline。

Medium Baseline

Score: 0.77788 Private score: 0.76056

根据助教提示，进行Data Augmentation，然后训练更多epoch。实作中从易到难尝试了三种Data Augmentation。

由于torchvision.transforms.v2比torchvision.transforms有更丰富的功能且速度更快，因此使用torchvision.transforms.v2实现Data Augmentation。

ref: https://pytorch.org/vision/0.21/transforms.html

  
import torchvision.transforms.v2 as transforms

使用基础API

  
train_tfm = transforms.Compose([
    transforms.RandomResizedCrop(128),    # 随机裁剪 & 缩放到 224x224
    transforms.RandomHorizontalFlip(p=0.5),  # 50% 概率水平翻转
    transforms.RandomRotation(degrees=15),  # 旋转 ±15°
    transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),  # 颜色抖动
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),  # 10% 随机平移
    transforms.RandomAdjustSharpness(sharpness_factor=2, p=0.5),  # 随机锐化
    transforms.RandomPosterize(bits=4, p=0.5),  # 颜色减少
    transforms.RandomPerspective(distortion_scale=0.2, p=0.5), # 随机透视变换
    transforms.ToImage(),  # 转换为 Tensor
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # 归一化
])

Auto-Augmentation

我们并不知道当前数据适合使用哪些Data Augmentation方法，因此借助Auto-Augmentation为数据集自动搜索合适的Augmentation方法。实作中使用TrivialAugmentWide(),其他方法也可以选用。

ref: https://pytorch.org/vision/0.21/transforms.html#auto-augmentation

  
train_tfm = transforms.Compose([
    transforms.RandomResizedCrop(128, antialias=True),
    transforms.RandomHorizontalFlip(0.5),
    transforms.TrivialAugmentWide(),
    transforms.PILToTensor(),
    transforms.ConvertImageDtype(torch.float),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
 ])

Auto-Augmentation + CutMix & MixUp Transform

在TrivialAugmentWide()基础上增加CutMix & MixUp Transform方法。 ref:https://pytorch.org/vision/main/auto_examples/transforms/plot_cutmix_mixup.html#sphx-glr-auto-examples-transforms-plot-cutmix-mixup-py

定义def collate_fn(batch)函数；

  
from torch.utils.data import default_collate

NUM_CLASSES = 11
cutmix = transforms.CutMix(num_classes= NUM_CLASSES)
mixup = transforms.MixUp(num_classes= NUM_CLASSES)
cutmix_or_mixup = transforms.RandomChoice([cutmix, mixup])

def collate_fn(batch):
    return cutmix_or_mixup(*default_collate(batch))

train_loader新增collate_fn=collate_fn参数；

  
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True, drop_last=True, collate_fn=collate_fn) # CutMix and MixUp Transform

总结

Auto-Augmentation + CutMix & MixUp Transform速度最快，效果最佳。如👇图所示，Private Score和Public Score较接近，模型没有发生过拟合，泛化性好。总共训练370个Epoch即可超过Medium Baseline(Private Score : 0.71361, Public Score : 0.73207)
基础API和Auto-Augmentation方法需要训练约1000个Epoch才能超过Medium Bseline，且模型发生过拟合，Score也较低。

Strong Baseline

Score: 0.87948 Private score: 0.87323

加载Medium Baseline中Auto-Augmentation + CutMix & MixUp Transform方法预训练的模型，然后在‘Testing and generate prediction CSV’阶段使用Test Time Augmentation(TTA)方法的version_2版本生成预测结果，就超过了Strong Baseline，并且很接近Boss Baseline，令人十分意外❗

模型预测阶段使用好的预测方法也可以极大的提高测试集精度，不一定要训练更强更复杂的模型。

Boss Baseline

Score: 0.88745 Private score: 0.87878

在Medium基础上使用ResNet50模型+dropout+batchnorm训练约500个epoch。

  
class ResNet50(nn.Module):
    def __init__(self, num_classes=11):
        super(ResNet50, self).__init__()
        
        # 加载 ResNet50 模型（去掉最后的 fc 层）
        resnet = resnet50(weights=None)  # 使用 weights=ResNet50_Weights.IMAGENET1K_V1 加载预训练权重
        self.cnn = nn.Sequential(*list(resnet.children())[:-1])  # 去掉 fc 层
        
        # 获取原始 fc 层的输入特征数
        num_features = resnet.fc.in_features
        
        # 定义新的 fc 层
        self.fc = nn.Sequential(
            nn.Linear(num_features, 1024),
            nn.BatchNorm1d(1024),  # Batch Normalization
            nn.ReLU(),
            nn.Dropout(0.5),  # Dropout

            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),  # Batch Normalization
            nn.ReLU(),
            nn.Dropout(0.5),  # Dropout

            nn.Linear(512, num_classes)  # 输出层
        )

    def forward(self, x):
        # 提取特征
        out = self.cnn(x)
        
        # 展平特征
        out = torch.flatten(out, 1)  # 使用 torch.flatten 代替 view
        
        # 分类
        out = self.fc(out)
        return out

挑选三份最好的结果作ensemble，实作中选择投票法。

Code

双过Boss Baseline

Report Questions

Q1. Augmentation Implementation (2%)

  
train_tfm = transforms.Compose([
    transforms.RandomResizedCrop(224, antialias=True), # 128
    transforms.RandomHorizontalFlip(0.5),
    transforms.TrivialAugmentWide(),
    transforms.PILToTensor(),
    transforms.ConvertImageDtype(torch.float),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
 ])

Q2. Residual Connection Implementation (2%)

  
class Residual_Network(nn.Module):
    def __init__(self):
        super(Residual_Network, self).__init__()

        self.cnn_layer1 = nn.Sequential(
            nn.Conv2d(3, 64, 3, 1, 1),
            nn.BatchNorm2d(64),
        )

        self.cnn_layer2 = nn.Sequential(
            nn.Conv2d(64, 64, 3, 1, 1),
            nn.BatchNorm2d(64),
        )

        self.cnn_layer3 = nn.Sequential(
            nn.Conv2d(64, 128, 3, 2, 1),
            nn.BatchNorm2d(128),
        )

        self.cnn_layer4 = nn.Sequential(
            nn.Conv2d(128, 128, 3, 1, 1),
            nn.BatchNorm2d(128),
        )
        self.cnn_layer5 = nn.Sequential(
            nn.Conv2d(128, 256, 3, 2, 1),
            nn.BatchNorm2d(256),
            )
        self.cnn_layer6 = nn.Sequential(
            nn.Conv2d(256, 256, 3, 1, 1),
            nn.BatchNorm2d(256),
        )
        self.fc_layer = nn.Sequential(
            nn.Linear(256* 32* 32, 256),
            nn.ReLU(),
            nn.Linear(256, 11)
        )
        self.relu = nn.ReLU()

def forward(self, x):
    # input (x): [batch_size, 3, 128, 128]
    # output: [batch_size, 11]

    # Extract features by convolutional layers.
    x1 = self.cnn_layer1(x)
    x1 = self.relu(x1)

    x2 = self.cnn_layer2(x1)
    x2 = self.relu(x2)
    
    # Residual connection: x2 + x1
    x2 = x2 + x1  

    x3 = self.cnn_layer3(x2)
    x3 = self.relu(x3)
    
    # Residual connection: x3 + x2
    x3 = x3 + x2  

    x4 = self.cnn_layer4(x3)
    x4 = self.relu(x4)
    
    # Residual connection: x4 + x3
    x4 = x4 + x3  

    x5 = self.cnn_layer5(x4)
    x5 = self.relu(x5)
    
    # Residual connection: x5 + x4
    x5 = x5 + x4  

    x6 = self.cnn_layer6(x5)
    x6 = self.relu(x6)
    
    # Residual connection: x6 + x5
    x6 = x6 + x5  

    # The extracted feature map must be flatten before going to fully-connected layers.
    xout = x6.flatten(1)  

    # The features are transformed by fully-connected layers to obtain the final logits.
    xout = self.fc_layer(xout)
    return xout