# 1

A CNN takes a 128×128128\times 128 RGB image (33 channels) as input. It is passed through the following sequence of layers:

  1. Conv layer with 1616 filters, each of size 5×55\times 5, stride 11, padding same;
  2. Max Pooling layer with 2×22\times 2 window, stride 22;
  3. Conv layer with 3232 filters, each of size 3×33\times 3, stride 11, padding valid;
  4. Max Pooling layer with 2×22\times 2 window, stride 22;
  5. A Flatten layer;
  6. A final Dense (Full Connected) layer with 100100 units.
  • Calculate the dimensions (height, width, number of channels) of the output after each of the first four layers. Show your work.
  • What is the total number of elements after the Flatten layer?
  • Calculate the number of trainable parameters in the first convolutional layer. (Remember: parameters include weights and biases.)

(1) 现在计算前四层分别的输出尺寸:

  • 卷积层 I:在相同填充的情况下,保持图像尺寸,只需要看滤波器数量

#Output1=128×128×16\#{\text{Output1}}=128\times 128\times 16

  • 池化层 I:池化窗口两两不相交,总尺寸减半,通道数不变

#Output2=64×64×16\#{\text{Output2}}=64\times 64\times 16

  • 卷积层 II:在无填充的情况下,步长为 11,高宽减少取决于滤波器尺寸 3×33\times 3,通道数取决于滤波器数量

#Output3=(643+1)×(643+1)×32=62×62×32\#{\text{Output3}}=(64-3+1)\times (64-3+1)\times 32=62\times 62\times 32

  • 池化层 II:池化窗口两两不相交,总尺寸减半,通道数不变

#Output4=31×31×32\#{\text{Output4}}=31\times 31\times 32

(2) 平铺层就是将上一步的输出拉直为一维向量,总元素数量为

#Flatten=31×31×32=30752\#\text{Flatten} = 31\times 31\times 32 = 30752

(3) 在卷积层 I 中,根据卷积公式

(IK)(i,j,k)=cmnIi+m,j+n,cKm,n,c,k+bk(I*K)(i,j,k) = \sum_c\sum_m \sum_n I_{i+m,j+n,c} K_{m,n,c,k} + b_k

其中 II 是输入图像,KK 是卷积核,bkb_k 是偏置项。输出通道数取决于滤波器数量,对于第 kk 个滤波器,其内部结构包含空间尺寸、输入通道数的权重和一个偏置项。因此,卷积层 I 的可训练参数总数为

#Params=(5×5×3+1)×16=1216\#\text{Params} = (5\times 5 \times 3 + 1) \times 16 = 1216

# 2

Build your own residual net and train on the CIFAR-10 dataset, try your best (within your computing capability) to get the test accuracy to 9090 percent. Submit your code and training log.

ResNet 的基本思想是引入残差连接缓解梯度消失问题。以下是用 AI 辅助编写的 ResNet 实现代码:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
import time
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import MultiStepLR, CosineAnnealingLR
# 检查 GPU 是否可用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 设置随机种子以确保可重复性
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
# ==================== 数据增强和预处理 ====================
# 训练数据增强
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
# 测试数据预处理
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
# 加载 CIFAR-10 数据集
batch_size = 128
train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=train_transform
)
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=test_transform
)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
print(f"训练集大小: {len(train_dataset)}")
print(f"测试集大小: {len(test_dataset)}")
# ==================== ResNet 模型定义 ====================
class BasicBlock(nn.Module):
    """ResNet基础残差块"""
    expansion = 1
    
    def __init__(self, in_channels, out_channels, stride=1):
        super(BasicBlock, self).__init__()
        
        # 第一个卷积层
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3, 
            stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        # 第二个卷积层
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # 快捷连接
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != self.expansion * out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(
                    in_channels, self.expansion * out_channels,
                    kernel_size=1, stride=stride, bias=False
                ),
                nn.BatchNorm2d(self.expansion * out_channels)
            )
    
    def forward(self, x):
        identity = x
        
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        out += self.shortcut(identity)
        out = F.relu(out)
        
        return out
class ResNet(nn.Module):
    """自定义ResNet模型"""
    def __init__(self, block, num_blocks, num_classes=10):
        super(ResNet, self).__init__()
        
        self.in_channels = 64
        
        # 初始卷积层
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        
        # 残差层
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        
        # 全局平均池化和全连接层
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)
        
        # Dropout(可选,用于防止过拟合)
        self.dropout = nn.Dropout(0.3)
    
    def _make_layer(self, block, out_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        
        for stride in strides:
            layers.append(block(self.in_channels, out_channels, stride))
            self.in_channels = out_channels * block.expansion
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        
        out = self.avgpool(out)
        out = torch.flatten(out, 1)
        out = self.dropout(out)
        out = self.fc(out)
        
        return out
def resnet18():
    """创建ResNet-18模型"""
    return ResNet(BasicBlock, [2, 2, 2, 2])
def resnet34():
    """创建ResNet-34模型"""
    return ResNet(BasicBlock, [3, 4, 6, 3])
# ==================== 训练和测试函数 ====================
def train(model, device, train_loader, optimizer, criterion, epoch):
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        
        # 梯度裁剪防止梯度爆炸
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        train_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()
        
        if batch_idx % 100 == 0:
            print(f'Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} '
                  f'({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}')
    
    train_accuracy = 100. * correct / total
    avg_loss = train_loss / len(train_loader)
    
    return avg_loss, train_accuracy
def test(model, device, test_loader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += criterion(output, target).item()
            _, predicted = output.max(1)
            total += target.size(0)
            correct += predicted.eq(target).sum().item()
    
    test_accuracy = 100. * correct / total
    avg_loss = test_loss / len(test_loader)
    
    return avg_loss, test_accuracy
# ==================== 主训练循环 ====================
def main():
    # 创建模型
    print("创建ResNet模型...")
    model = resnet18().to(device)
    
    # 打印模型参数数量
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"模型总参数量: {total_params:,}")
    print(f"可训练参数量: {trainable_params:,}")
    
    # 定义损失函数和优化器
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # 标签平滑防止过拟合
    optimizer = optim.SGD(
        model.parameters(), 
        lr=0.1,  # 初始学习率
        momentum=0.9, 
        weight_decay=5e-4  # L2 正则化
    )
    
    # 学习率调度器
    scheduler = MultiStepLR(optimizer, milestones=[100, 150], gamma=0.1)
    # 或者使用余弦退火
    # scheduler = CosineAnnealingLR(optimizer, T_max=200)
    
    num_epochs = 200
    best_accuracy = 0
    best_model_state = None
    
    # 用于记录训练过程
    train_losses, train_accs = [], []
    test_losses, test_accs = [], []
    
    print(f"\n开始训练,共{num_epochs}个epoch...")
    print("="*60)
    
    for epoch in range(1, num_epochs + 1):
        start_time = time.time()
        
        # 训练
        train_loss, train_acc = train(model, device, train_loader, optimizer, criterion, epoch)
        train_losses.append(train_loss)
        train_accs.append(train_acc)
        
        # 测试
        test_loss, test_acc = test(model, device, test_loader, criterion)
        test_losses.append(test_loss)
        test_accs.append(test_acc)
        
        # 更新学习率
        scheduler.step()
        
        # 保存最佳模型
        if test_acc > best_accuracy:
            best_accuracy = test_acc
            best_model_state = model.state_dict().copy()
            torch.save({
                'epoch': epoch,
                'model_state_dict': best_model_state,
                'optimizer_state_dict': optimizer.state_dict(),
                'accuracy': best_accuracy,
            }, 'best_model.pth')
        
        epoch_time = time.time() - start_time
        
        print(f'Epoch {epoch:3d}/{num_epochs} | '
              f'Time: {epoch_time:.2f}s | '
              f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | '
              f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}% | '
              f'LR: {optimizer.param_groups[0]["lr"]:.6f}')
        
        # 每 10 个 epoch 保存一次中间结果
        if epoch % 10 == 0:
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'train_losses': train_losses,
                'test_accs': test_accs,
            }, f'checkpoint_epoch_{epoch}.pth')
    
    # 加载最佳模型
    model.load_state_dict(best_model_state)
    
    # 最终测试
    final_loss, final_accuracy = test(model, device, test_loader, criterion)
    print(f"\n{'='*60}")
    print(f"训练完成!")
    print(f"最佳测试准确率: {best_accuracy:.2f}%")
    print(f"最终测试准确率: {final_accuracy:.2f}%")
    
    # 绘制训练曲线
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.plot(train_losses, label='Train Loss')
    plt.plot(test_losses, label='Test Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.title('Training and Test Loss')
    
    plt.subplot(1, 2, 2)
    plt.plot(train_accs, label='Train Accuracy')
    plt.plot(test_accs, label='Test Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy (%)')
    plt.legend()
    plt.title('Training and Test Accuracy')
    
    plt.tight_layout()
    plt.savefig('training_curves.png')
    plt.show()
    
    return model, best_accuracy
if __name__ == '__main__':
    model, best_acc = main()

部分训练日志如下:

Using device: cuda
训练集大小: 50000
测试集大小: 10000
创建ResNet模型...
模型总参数量: 11,173,962
可训练参数量: 11,173,962
开始训练,共200个epoch...
============================================================
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 1 [0/50000 (0%)] Loss: 2.598158
Epoch: 1 [12800/50000 (26%)]    Loss: 1.886283
Epoch: 1 [25600/50000 (51%)]    Loss: 1.757755
Epoch: 1 [38400/50000 (77%)]    Loss: 1.580445
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   1/200 | Time: 53.87s | Train Loss: 1.8160 | Train Acc: 39.76% | Test Loss: 1.4789 | Test Acc: 56.14% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 2 [0/50000 (0%)] Loss: 1.444861
Epoch: 2 [12800/50000 (26%)]    Loss: 1.479313
Epoch: 2 [25600/50000 (51%)]    Loss: 1.316082
Epoch: 2 [38400/50000 (77%)]    Loss: 1.387875
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   2/200 | Time: 53.57s | Train Loss: 1.3979 | Train Acc: 59.62% | Test Loss: 1.3437 | Test Acc: 62.63% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 3 [0/50000 (0%)] Loss: 1.537805
Epoch: 3 [12800/50000 (26%)]    Loss: 1.199809
Epoch: 3 [25600/50000 (51%)]    Loss: 1.230053
Epoch: 3 [38400/50000 (77%)]    Loss: 1.389225
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   3/200 | Time: 53.94s | Train Loss: 1.2286 | Train Acc: 67.93% | Test Loss: 1.2040 | Test Acc: 69.61% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 4 [0/50000 (0%)] Loss: 1.127333
Epoch: 4 [12800/50000 (26%)]    Loss: 1.309616
Epoch: 4 [25600/50000 (51%)]    Loss: 1.030421
Epoch: 4 [38400/50000 (77%)]    Loss: 1.079045
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   4/200 | Time: 53.66s | Train Loss: 1.1251 | Train Acc: 72.60% | Test Loss: 1.2645 | Test Acc: 68.30% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 5 [0/50000 (0%)] Loss: 1.137778
Epoch: 5 [12800/50000 (26%)]    Loss: 1.047550
Epoch: 5 [25600/50000 (51%)]    Loss: 0.974705
Epoch: 5 [38400/50000 (77%)]    Loss: 1.086789
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   5/200 | Time: 53.51s | Train Loss: 1.0631 | Train Acc: 75.61% | Test Loss: 1.0473 | Test Acc: 75.64% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 6 [0/50000 (0%)] Loss: 1.064054
Epoch: 6 [12800/50000 (26%)]    Loss: 0.963964
Epoch: 6 [25600/50000 (51%)]    Loss: 1.007503
Epoch: 6 [38400/50000 (77%)]    Loss: 0.969994
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   6/200 | Time: 54.46s | Train Loss: 1.0097 | Train Acc: 77.95% | Test Loss: 1.1400 | Test Acc: 72.96% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 7 [0/50000 (0%)] Loss: 0.895288
Epoch: 7 [12800/50000 (26%)]    Loss: 0.995895
Epoch: 7 [25600/50000 (51%)]    Loss: 0.936781
Epoch: 7 [38400/50000 (77%)]    Loss: 1.027178
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   7/200 | Time: 53.85s | Train Loss: 0.9801 | Train Acc: 79.23% | Test Loss: 1.0211 | Test Acc: 76.88% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 8 [0/50000 (0%)] Loss: 0.983641
Epoch: 8 [12800/50000 (26%)]    Loss: 0.804812
Epoch: 8 [25600/50000 (51%)]    Loss: 0.898298
Epoch: 8 [38400/50000 (77%)]    Loss: 0.971728
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   8/200 | Time: 55.00s | Train Loss: 0.9553 | Train Acc: 80.29% | Test Loss: 0.9534 | Test Acc: 80.80% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 9 [0/50000 (0%)] Loss: 0.874495
Epoch: 9 [12800/50000 (26%)]    Loss: 0.851649
Epoch: 9 [25600/50000 (51%)]    Loss: 0.977242
Epoch: 9 [38400/50000 (77%)]    Loss: 1.067239
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch   9/200 | Time: 79.66s | Train Loss: 0.9405 | Train Acc: 81.11% | Test Loss: 0.9553 | Test Acc: 80.36% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 10 [0/50000 (0%)]        Loss: 0.989358
Epoch: 10 [12800/50000 (26%)]   Loss: 0.845767
Epoch: 10 [25600/50000 (51%)]   Loss: 0.907775
Epoch: 10 [38400/50000 (77%)]   Loss: 1.033314
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  10/200 | Time: 53.40s | Train Loss: 0.9259 | Train Acc: 81.70% | Test Loss: 0.9687 | Test Acc: 79.80% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 11 [0/50000 (0%)]        Loss: 1.002646
Epoch: 11 [12800/50000 (26%)]   Loss: 0.807637
Epoch: 11 [25600/50000 (51%)]   Loss: 0.912228
Epoch: 11 [38400/50000 (77%)]   Loss: 0.950749
Epoch: 11 [38400/50000 (77%)]   Loss: 0.950749
Using device: cuda
训练集大小: 50000
测试集大小: 10000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  11/200 | Time: 53.85s | Train Loss: 0.9147 | Train Acc: 82.18% | Test Loss: 0.9289 | Test Acc: 81.39% | LR: 0.100000
Epoch  11/200 | Time: 53.85s | Train Loss: 0.9147 | Train Acc: 82.18% | Test Loss: 0.9289 | Test Acc: 81.39% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 12 [0/50000 (0%)]        Loss: 0.924864
Epoch: 12 [12800/50000 (26%)]   Loss: 0.942824
Epoch: 12 [25600/50000 (51%)]   Loss: 1.009633
Epoch: 12 [38400/50000 (77%)]   Loss: 0.862141
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  12/200 | Time: 54.52s | Train Loss: 0.9080 | Train Acc: 82.51% | Test Loss: 1.1401 | Test Acc: 72.76% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 13 [0/50000 (0%)]        Loss: 0.934913
Epoch: 13 [12800/50000 (26%)]   Loss: 0.898304
Epoch: 13 [25600/50000 (51%)]   Loss: 0.896272
Epoch: 13 [38400/50000 (77%)]   Loss: 0.865075
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  13/200 | Time: 54.63s | Train Loss: 0.9009 | Train Acc: 82.75% | Test Loss: 0.8914 | Test Acc: 83.25% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 14 [0/50000 (0%)]        Loss: 0.891441
Epoch: 14 [12800/50000 (26%)]   Loss: 0.907482
Epoch: 14 [25600/50000 (51%)]   Loss: 0.857055
Epoch: 14 [38400/50000 (77%)]   Loss: 0.914766
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  14/200 | Time: 54.15s | Train Loss: 0.8900 | Train Acc: 83.20% | Test Loss: 0.9459 | Test Acc: 81.01% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 15 [0/50000 (0%)]        Loss: 0.800900
Epoch: 15 [12800/50000 (26%)]   Loss: 0.867119
Epoch: 15 [25600/50000 (51%)]   Loss: 0.853828
Epoch: 15 [38400/50000 (77%)]   Loss: 0.828357
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  15/200 | Time: 54.15s | Train Loss: 0.8804 | Train Acc: 83.69% | Test Loss: 0.9595 | Test Acc: 80.49% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 16 [0/50000 (0%)]        Loss: 0.906890
Epoch: 16 [12800/50000 (26%)]   Loss: 0.887977
Epoch: 16 [25600/50000 (51%)]   Loss: 0.770198
Epoch: 16 [38400/50000 (77%)]   Loss: 0.843562
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  16/200 | Time: 54.12s | Train Loss: 0.8753 | Train Acc: 83.86% | Test Loss: 0.9450 | Test Acc: 80.98% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 17 [0/50000 (0%)]        Loss: 0.875333
Epoch: 17 [12800/50000 (26%)]   Loss: 0.940536
Epoch: 17 [25600/50000 (51%)]   Loss: 0.926693
Epoch: 17 [38400/50000 (77%)]   Loss: 0.918440
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  17/200 | Time: 54.22s | Train Loss: 0.8747 | Train Acc: 83.97% | Test Loss: 0.9690 | Test Acc: 79.95% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 18 [0/50000 (0%)]        Loss: 0.857610
Epoch: 18 [12800/50000 (26%)]   Loss: 0.872131
Epoch: 18 [25600/50000 (51%)]   Loss: 0.981058
Epoch: 18 [38400/50000 (77%)]   Loss: 0.863057
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  18/200 | Time: 53.98s | Train Loss: 0.8674 | Train Acc: 84.26% | Test Loss: 1.0064 | Test Acc: 78.28% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 19 [0/50000 (0%)]        Loss: 0.860547
Epoch: 19 [12800/50000 (26%)]   Loss: 0.903203
Epoch: 19 [25600/50000 (51%)]   Loss: 0.822991
Epoch: 19 [38400/50000 (77%)]   Loss: 0.921520
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  19/200 | Time: 54.02s | Train Loss: 0.8631 | Train Acc: 84.58% | Test Loss: 1.0266 | Test Acc: 78.11% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 20 [0/50000 (0%)]        Loss: 0.864497
Epoch: 20 [12800/50000 (26%)]   Loss: 0.826816
Epoch: 20 [25600/50000 (51%)]   Loss: 0.836563
Epoch: 20 [38400/50000 (77%)]   Loss: 0.812488
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  20/200 | Time: 54.56s | Train Loss: 0.8608 | Train Acc: 84.60% | Test Loss: 0.9077 | Test Acc: 82.32% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 21 [0/50000 (0%)]        Loss: 0.795630
Epoch: 21 [12800/50000 (26%)]   Loss: 0.895831
Epoch: 21 [25600/50000 (51%)]   Loss: 0.943226
Epoch: 21 [38400/50000 (77%)]   Loss: 0.773665
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  21/200 | Time: 54.25s | Train Loss: 0.8601 | Train Acc: 84.66% | Test Loss: 1.0858 | Test Acc: 75.16% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 22 [0/50000 (0%)]        Loss: 0.938416
Epoch: 22 [12800/50000 (26%)]   Loss: 0.888500
Epoch: 22 [25600/50000 (51%)]   Loss: 0.756124
Epoch: 22 [38400/50000 (77%)]   Loss: 0.843318
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  22/200 | Time: 56.05s | Train Loss: 0.8569 | Train Acc: 84.87% | Test Loss: 0.8957 | Test Acc: 83.12% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 23 [0/50000 (0%)]        Loss: 0.914953
Epoch: 23 [12800/50000 (26%)]   Loss: 0.881937
Epoch: 23 [25600/50000 (51%)]   Loss: 0.935297
Epoch: 23 [38400/50000 (77%)]   Loss: 0.873078
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  23/200 | Time: 54.64s | Train Loss: 0.8535 | Train Acc: 84.86% | Test Loss: 0.9544 | Test Acc: 80.63% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 24 [0/50000 (0%)]        Loss: 0.803802
Epoch: 24 [12800/50000 (26%)]   Loss: 0.791671
Epoch: 24 [25600/50000 (51%)]   Loss: 0.809881
Epoch: 24 [38400/50000 (77%)]   Loss: 0.838971
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  24/200 | Time: 54.93s | Train Loss: 0.8514 | Train Acc: 84.83% | Test Loss: 0.9577 | Test Acc: 79.69% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 25 [0/50000 (0%)]        Loss: 0.858007
Epoch: 25 [12800/50000 (26%)]   Loss: 0.756386
Epoch: 25 [25600/50000 (51%)]   Loss: 0.863666
Epoch: 25 [38400/50000 (77%)]   Loss: 0.860152
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  25/200 | Time: 54.81s | Train Loss: 0.8496 | Train Acc: 85.14% | Test Loss: 0.9546 | Test Acc: 80.74% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 26 [0/50000 (0%)]        Loss: 0.831551
Epoch: 26 [12800/50000 (26%)]   Loss: 0.916696
Epoch: 26 [25600/50000 (51%)]   Loss: 1.014461
Epoch: 26 [38400/50000 (77%)]   Loss: 0.929880
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  26/200 | Time: 54.58s | Train Loss: 0.8450 | Train Acc: 85.26% | Test Loss: 0.8793 | Test Acc: 83.58% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 27 [0/50000 (0%)]        Loss: 0.840893
Epoch: 27 [12800/50000 (26%)]   Loss: 0.878673
Epoch: 27 [25600/50000 (51%)]   Loss: 0.806746
Epoch: 27 [38400/50000 (77%)]   Loss: 0.888433
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  27/200 | Time: 56.18s | Train Loss: 0.8430 | Train Acc: 85.37% | Test Loss: 0.8076 | Test Acc: 86.60% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 28 [0/50000 (0%)]        Loss: 0.789495
Epoch: 28 [12800/50000 (26%)]   Loss: 0.762469
Epoch: 28 [25600/50000 (51%)]   Loss: 0.848810
Epoch: 28 [38400/50000 (77%)]   Loss: 0.874455
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  28/200 | Time: 72.53s | Train Loss: 0.8402 | Train Acc: 85.43% | Test Loss: 0.9131 | Test Acc: 82.04% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 29 [0/50000 (0%)]        Loss: 0.784175
Epoch: 29 [12800/50000 (26%)]   Loss: 0.763392
Epoch: 29 [25600/50000 (51%)]   Loss: 0.715707
Epoch: 29 [38400/50000 (77%)]   Loss: 0.715120
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  29/200 | Time: 54.82s | Train Loss: 0.8369 | Train Acc: 85.72% | Test Loss: 0.8513 | Test Acc: 84.56% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 30 [0/50000 (0%)]        Loss: 0.800729
Epoch: 30 [12800/50000 (26%)]   Loss: 0.822383
Epoch: 30 [25600/50000 (51%)]   Loss: 0.916324
Epoch: 30 [38400/50000 (77%)]   Loss: 0.796236
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  30/200 | Time: 54.51s | Train Loss: 0.8414 | Train Acc: 85.39% | Test Loss: 0.8714 | Test Acc: 84.37% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 31 [0/50000 (0%)]        Loss: 0.848700
Epoch: 31 [12800/50000 (26%)]   Loss: 0.787045
Epoch: 31 [25600/50000 (51%)]   Loss: 0.847590
Epoch: 31 [38400/50000 (77%)]   Loss: 0.907628
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  31/200 | Time: 54.08s | Train Loss: 0.8357 | Train Acc: 85.66% | Test Loss: 0.9253 | Test Acc: 81.90% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 32 [0/50000 (0%)]        Loss: 0.898938
Epoch: 32 [12800/50000 (26%)]   Loss: 0.834288
Epoch: 32 [25600/50000 (51%)]   Loss: 0.800032
Epoch: 32 [38400/50000 (77%)]   Loss: 0.911632
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  32/200 | Time: 54.14s | Train Loss: 0.8388 | Train Acc: 85.57% | Test Loss: 1.0133 | Test Acc: 78.52% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 33 [0/50000 (0%)]        Loss: 0.774994
Epoch: 33 [12800/50000 (26%)]   Loss: 0.882923
Epoch: 33 [25600/50000 (51%)]   Loss: 0.812098
Epoch: 33 [38400/50000 (77%)]   Loss: 0.824945
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  33/200 | Time: 54.45s | Train Loss: 0.8301 | Train Acc: 86.01% | Test Loss: 0.9056 | Test Acc: 82.61% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 34 [0/50000 (0%)]        Loss: 0.845768
Epoch: 34 [12800/50000 (26%)]   Loss: 0.827564
Epoch: 34 [25600/50000 (51%)]   Loss: 0.769887
Epoch: 34 [38400/50000 (77%)]   Loss: 0.800083
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  34/200 | Time: 54.80s | Train Loss: 0.8357 | Train Acc: 85.66% | Test Loss: 0.8552 | Test Acc: 84.82% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 35 [0/50000 (0%)]        Loss: 0.846894
Epoch: 35 [12800/50000 (26%)]   Loss: 0.789980
Epoch: 35 [25600/50000 (51%)]   Loss: 0.807628
Epoch: 35 [38400/50000 (77%)]   Loss: 0.902017
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  35/200 | Time: 54.73s | Train Loss: 0.8322 | Train Acc: 85.80% | Test Loss: 0.8848 | Test Acc: 83.20% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 36 [0/50000 (0%)]        Loss: 0.831893
Epoch: 36 [12800/50000 (26%)]   Loss: 0.780592
Epoch: 36 [25600/50000 (51%)]   Loss: 0.749251
Epoch: 36 [38400/50000 (77%)]   Loss: 0.876861
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  36/200 | Time: 54.61s | Train Loss: 0.8310 | Train Acc: 86.02% | Test Loss: 0.9535 | Test Acc: 81.04% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 37 [0/50000 (0%)]        Loss: 0.917584
Epoch: 37 [12800/50000 (26%)]   Loss: 0.752700
Epoch: 37 [25600/50000 (51%)]   Loss: 0.834917
Epoch: 37 [38400/50000 (77%)]   Loss: 0.849060
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  37/200 | Time: 54.78s | Train Loss: 0.8295 | Train Acc: 85.89% | Test Loss: 0.8575 | Test Acc: 84.79% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 38 [0/50000 (0%)]        Loss: 0.808229
Epoch: 38 [12800/50000 (26%)]   Loss: 0.764553
Epoch: 38 [25600/50000 (51%)]   Loss: 0.869413
Epoch: 38 [38400/50000 (77%)]   Loss: 0.975918
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  38/200 | Time: 54.38s | Train Loss: 0.8325 | Train Acc: 85.92% | Test Loss: 1.0467 | Test Acc: 77.26% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 39 [0/50000 (0%)]        Loss: 0.792865
Epoch: 39 [12800/50000 (26%)]   Loss: 0.740830
Epoch: 39 [25600/50000 (51%)]   Loss: 0.903176
Epoch: 39 [38400/50000 (77%)]   Loss: 0.849468
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  39/200 | Time: 54.15s | Train Loss: 0.8291 | Train Acc: 85.95% | Test Loss: 0.8530 | Test Acc: 84.60% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 40 [0/50000 (0%)]        Loss: 0.794266
Epoch: 40 [12800/50000 (26%)]   Loss: 0.965265
Epoch: 40 [25600/50000 (51%)]   Loss: 0.970391
Epoch: 40 [38400/50000 (77%)]   Loss: 0.820289
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  40/200 | Time: 54.60s | Train Loss: 0.8263 | Train Acc: 86.14% | Test Loss: 0.9158 | Test Acc: 82.22% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 41 [0/50000 (0%)]        Loss: 0.859779
Epoch: 41 [12800/50000 (26%)]   Loss: 0.823970
Epoch: 41 [25600/50000 (51%)]   Loss: 0.845126
Epoch: 41 [38400/50000 (77%)]   Loss: 0.894197
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  41/200 | Time: 54.31s | Train Loss: 0.8295 | Train Acc: 86.02% | Test Loss: 0.9632 | Test Acc: 80.73% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 42 [0/50000 (0%)]        Loss: 0.751721
Epoch: 42 [12800/50000 (26%)]   Loss: 0.927409
Epoch: 42 [25600/50000 (51%)]   Loss: 0.779975
Epoch: 42 [38400/50000 (77%)]   Loss: 0.827765
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  42/200 | Time: 55.41s | Train Loss: 0.8295 | Train Acc: 86.01% | Test Loss: 0.9369 | Test Acc: 81.37% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 43 [0/50000 (0%)]        Loss: 0.812640
Epoch: 43 [12800/50000 (26%)]   Loss: 0.851625
Epoch: 43 [25600/50000 (51%)]   Loss: 0.821735
Epoch: 43 [38400/50000 (77%)]   Loss: 0.775307
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  43/200 | Time: 54.73s | Train Loss: 0.8267 | Train Acc: 86.21% | Test Loss: 0.8423 | Test Acc: 85.28% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 44 [0/50000 (0%)]        Loss: 0.806267
Epoch: 44 [12800/50000 (26%)]   Loss: 0.920151
Epoch: 44 [25600/50000 (51%)]   Loss: 1.011200
Epoch: 44 [38400/50000 (77%)]   Loss: 0.837567
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  44/200 | Time: 54.33s | Train Loss: 0.8230 | Train Acc: 86.32% | Test Loss: 0.8519 | Test Acc: 84.80% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 45 [0/50000 (0%)]        Loss: 0.808053
Epoch: 45 [12800/50000 (26%)]   Loss: 0.894567
Epoch: 45 [25600/50000 (51%)]   Loss: 0.771580
Epoch: 45 [38400/50000 (77%)]   Loss: 0.710069
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  45/200 | Time: 54.43s | Train Loss: 0.8222 | Train Acc: 86.32% | Test Loss: 0.8485 | Test Acc: 85.10% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 46 [0/50000 (0%)]        Loss: 0.789186
Epoch: 46 [12800/50000 (26%)]   Loss: 0.834816
Epoch: 46 [25600/50000 (51%)]   Loss: 0.878065
Epoch: 46 [38400/50000 (77%)]   Loss: 0.829138
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  46/200 | Time: 54.23s | Train Loss: 0.8225 | Train Acc: 86.18% | Test Loss: 0.8551 | Test Acc: 84.87% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 47 [0/50000 (0%)]        Loss: 0.801989
Epoch: 47 [12800/50000 (26%)]   Loss: 0.917647
Epoch: 47 [25600/50000 (51%)]   Loss: 0.761201
Epoch: 47 [38400/50000 (77%)]   Loss: 0.815080
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  47/200 | Time: 55.31s | Train Loss: 0.8207 | Train Acc: 86.42% | Test Loss: 0.8701 | Test Acc: 83.91% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 48 [0/50000 (0%)]        Loss: 0.852317
Epoch: 48 [12800/50000 (26%)]   Loss: 0.847572
Epoch: 48 [25600/50000 (51%)]   Loss: 0.910867
Epoch: 48 [38400/50000 (77%)]   Loss: 0.904468
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  48/200 | Time: 54.46s | Train Loss: 0.8178 | Train Acc: 86.49% | Test Loss: 0.8769 | Test Acc: 83.87% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 49 [0/50000 (0%)]        Loss: 0.732170
Epoch: 49 [12800/50000 (26%)]   Loss: 0.854018
Epoch: 49 [25600/50000 (51%)]   Loss: 0.850230
Epoch: 49 [38400/50000 (77%)]   Loss: 0.806912
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  49/200 | Time: 54.66s | Train Loss: 0.8210 | Train Acc: 86.41% | Test Loss: 0.9036 | Test Acc: 82.96% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 50 [0/50000 (0%)]        Loss: 0.855476
Epoch: 50 [12800/50000 (26%)]   Loss: 0.802625
Epoch: 50 [25600/50000 (51%)]   Loss: 0.878814
Epoch: 50 [38400/50000 (77%)]   Loss: 0.725013
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  50/200 | Time: 54.50s | Train Loss: 0.8196 | Train Acc: 86.33% | Test Loss: 0.8937 | Test Acc: 83.23% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 51 [0/50000 (0%)]        Loss: 0.792211
Epoch: 51 [12800/50000 (26%)]   Loss: 0.828319
Epoch: 51 [25600/50000 (51%)]   Loss: 0.758899
Epoch: 51 [38400/50000 (77%)]   Loss: 0.966560
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  51/200 | Time: 54.87s | Train Loss: 0.8188 | Train Acc: 86.33% | Test Loss: 0.9148 | Test Acc: 82.11% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 52 [0/50000 (0%)]        Loss: 0.814625
Epoch: 52 [12800/50000 (26%)]   Loss: 0.810442
Epoch: 52 [25600/50000 (51%)]   Loss: 0.809587
Epoch: 52 [38400/50000 (77%)]   Loss: 0.787845
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  52/200 | Time: 54.67s | Train Loss: 0.8217 | Train Acc: 86.33% | Test Loss: 0.8117 | Test Acc: 86.61% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 53 [0/50000 (0%)]        Loss: 0.751319
Epoch: 53 [12800/50000 (26%)]   Loss: 0.826305
Epoch: 53 [25600/50000 (51%)]   Loss: 0.735571
Epoch: 53 [38400/50000 (77%)]   Loss: 0.803900
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  53/200 | Time: 55.02s | Train Loss: 0.8160 | Train Acc: 86.40% | Test Loss: 0.8388 | Test Acc: 85.01% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 54 [0/50000 (0%)]        Loss: 0.884010
Epoch: 54 [12800/50000 (26%)]   Loss: 0.844160
Epoch: 54 [25600/50000 (51%)]   Loss: 0.874899
Epoch: 54 [38400/50000 (77%)]   Loss: 0.785140
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  54/200 | Time: 55.07s | Train Loss: 0.8208 | Train Acc: 86.37% | Test Loss: 0.8847 | Test Acc: 83.48% | LR: 0.100000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch: 55 [0/50000 (0%)]        Loss: 0.747851
Epoch: 55 [12800/50000 (26%)]   Loss: 0.796169
Epoch: 55 [25600/50000 (51%)]   Loss: 0.835408
Epoch: 55 [38400/50000 (77%)]   Loss: 1.008027
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Using device: cuda
训练集大小: 50000
测试集大小: 10000
Epoch  55/200 | Time: 54.39s | Train Loss: 0.8207 | Train Acc: 86.27% | Test Loss: 0.8297 | Test Acc: 85.93% | LR: 0.100000

# 3

Consider a simple RNN cell defined by the following equations:

where:

  • xtx_t is the input at time step tt;
  • hth_t is the hidden state at time step tt (h0h_0 is initialized to zeros);
  • yty_t is the output at time step tt.

Given an input sequence x=[x1,x2,x3]\pmb x=[x_1, x_2, x_3]:

  1. Draw the computational graph by unrolling the RNN for all three time steps. Clearly show the inputs xtx_t, hidden states hth_t, outputs yty_t, and the sharing of weights WxhW_{xh}, WhhW_{hh}, and WhyW_{hy} across time.
  2. Using your graph, explain how the hidden stat h3h_3 depends on the entire input sequence [x1,x2,x3][x_1, x_2, x_3]. Why is the hidden state considered the "memory" of the network?

(1) 交换图如下

y1y2y3bh0h1h2h3bxx1x2x3WhhWhhWhyWhhWhyWhyWxhWxhWxh

(2) 展开 h3h_3 的计算过程:

h3=tanh(Wxhx3+Whhh2+bh)=tanh(Wxhx3+Whhtanh(Wxhx2+Whhh1+bh)+bh)=tanh(Wxhx3+Whhtanh(Wxhx2+Whhtanh(Wxhx1+Whhh0+bh)+bh)+bh)=tanh(Wxhx3+Whhtanh(Wxhx2+Whhtanh(Wxhx1+bh)+bh)+bh)\begin{array}{ll}h_3&=\tanh(W_{xh} x_{3} + W_{hh} h_{2} + b_h) \\ &=\tanh(W_{xh} x_{3} + W_{hh} \tanh(W_{xh} x_{2} + W_{hh} h_{1} + b_h) + b_h) \\ &=\tanh(W_{xh} x_{3} + W_{hh} \tanh(W_{xh} x_{2} + W_{hh} \tanh(W_{xh} x_{1} + W_{hh} h_{0} + b_h) + b_h) + b_h) \\ &=\tanh(W_{xh} x_{3} + W_{hh} \tanh(W_{xh} x_{2} + W_{hh} \tanh(W_{xh} x_{1} + b_h) + b_h) + b_h)\end{array}

或者从图模型的角度出发,可以看出 h3h_3x1,x2,x3x_1, x_2, x_3 之间存在路径连接,因此 h3h_3 依赖于整个输入序列 [x1,x2,x3][x_1, x_2, x_3]。上述给出了 h3h_3 的计算表达式,其实对 hth_t 都是类似的,它们都包含了前面所有时间步的输入信息,所以承担了记忆过去时间步信息的角色,因此被称为网络的 “记忆”。具体地记忆调整取决于权重矩阵 WxhW_{xh}WhhW_{hh} 的学习过程。激活函数 tanh 用来重新编码信息,而权重矩阵则决定了信息的存储和遗忘倾向。

# 4

The RNN network was described in last problem.

  1. During Backpropagation Through Time (BPTT), the gradient of the loss LL with respect to an early hidden state h1h_1 involves a long chain of derivatives. Write the expression for Lh1\dfrac{\partial L}{\partial h_1} in terms of Lh3\dfrac {\partial L}{\partial h_3}, showing the chain rule through h2h_2.
  2. The tanh derivative is ddhtanh(h)=1tanh2(h)\dfrac {\mathrm d}{\mathrm d h} \tanh(h) = 1 - \tanh^2(h), which is always 1\leq 1. Explain how repeated multiplication of these derivatives (and the weight matrix WhhW_{hh}) during BPTT can cause the vanishing gradient problem.
  3. What is the negative consequence of vanishing gradients for an RNN's ability to learn longrange dependencies in sequences?

(1) 根据图模型的链式法则,如果考虑随时间反向传播,则

Lh1=Lh3h3h2h2h1\dfrac {\partial L}{\partial h_1} = \dfrac {\partial L}{\partial h_3} \cdot \dfrac {\partial h_3}{\partial h_2} \cdot \dfrac {\partial h_2}{\partial h_1}

其中 hth_t 一般是向量,因此上述过程实际上是矩阵求导和矩阵乘法。中间量计算如下,因为激活函数 tanh\tanh 是逐元素应用的,所以

htht1=ht1tanh(Wxhxt+Whhht1+bh)=Atanh(A)Aht1=diag(1tanh2(A))Whh=diag(1ht2)Whh\begin{array}{ll}\displaystyle \dfrac {\partial h_t}{\partial h_{t-1}}&=\dfrac {\partial }{\partial h_{t-1}} \tanh(W_{xh} x_{t} + W_{hh} h_{t-1} + b_h) \\ \\&= \dfrac {\partial }{\partial A}\tanh(A) \cdot \dfrac {\partial A}{\partial h_{t-1}} \\ \\&= \mathrm{diag}(1 - \tanh^2(A)) \cdot W_{hh} \\ \\&= \mathrm{diag}(1 - h_t^2) \cdot W_{hh}\end{array}

所以表达式可以代入,写为

Lh1=Lh3diag(1h32)Whhdiag(1h22)Whh\dfrac {\partial L}{\partial h_1} = \dfrac {\partial L}{\partial h_3} \cdot \mathrm{diag}(1 - h_3^2) W_{hh} \cdot \mathrm{diag}(1 - h_2^2) W_{hh}

(2) 在 BPTT 过程中,反向传播算法一般写成连乘的形式,如上所示,而每一步中 1ht21-h_t^2 的元素值都小于等于 11,如果对长序列采用这个架构,那么计算出来的梯度会以指数量级衰减,在计算机里可能会因为精度问题而变成 00,这就是梯度消失问题。

(3) 梯度消失会导致网络无法通过梯度下降法有效地更新早期时间步的参数,从而使得模型难以学习和记忆长距离依赖关系,直观上就是 “记忆能力有限”。这意味着 RNN 在处理长序列时,可能无法捕捉到序列中远距离元素之间的重要联系,影响模型的性能和泛化能力。好在 LSTM 和 GRU 通过引入门控机制,将梯度的乘积算法改进成了线性组合,从而缓解了梯度消失问题。

# 5

Using the attached dataset (input.txt), and using the RNN with the GRU cell to train a language model. Submit your model and the sampling results.

见附件中 Best_GRU_Model.pth 以及 single_sample.txt。

# 6

Given the fundamental attention equation:

Attention(Q,K,V)=softmax(QKTdk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V

Consider the following input matrices:

Q=[123456],K=[2143],V=[0.51.01.52.0]Q = \begin{bmatrix}1 & 2\\3 & 4\\5 & 6\end{bmatrix}, \quad K = \begin{bmatrix}2 & 1\\ 4&3 \end{bmatrix}, \quad V = \begin{bmatrix}0.5 & 1.0\\1.5 & 2.0\end{bmatrix}

where QR3×2Q\in\mathbb R^{3\times 2}, KR2×2K\in\mathbb R^{2\times 2}, VR2×2V\in\mathbb R^{2\times 2}, and dk=2d_k=2.

  1. Compute the raw attention scores QKTQK^T and show all steps.
  2. Apply the scaling factor 1dk\dfrac{1}{\sqrt{d_k}} to the raw scores.
  3. Calculate the attention weights by applying the softmax function to each row of the scaled scores. Show your work for at least one row.
  4. Compute the final output by multiplying the attention weights with VV.
  5. What is the shape of the final output matrix? Explain why this shape makes sense given the inputs.

(1) 计算 QKTQK^T

[123456][2413]=[12+2114+2332+4134+4352+6154+63]=[41010241638]\begin{bmatrix}1 & 2\\3 & 4\\5 & 6\end{bmatrix} \begin{bmatrix}2 & 4\\1 & 3\end{bmatrix} = \begin{bmatrix}1\cdot 2 + 2\cdot 1 & 1\cdot 4 + 2\cdot 3 \\ 3\cdot 2 + 4\cdot 1 & 3\cdot 4 + 4\cdot 3 \\ 5\cdot 2 + 6\cdot 1 & 5\cdot 4 + 6\cdot 3 \end{bmatrix} = \begin{bmatrix}4 & 10 \\ 10 & 24 \\ 16 & 38 \end{bmatrix}

(2) 应用缩放因子 12\dfrac{1}{\sqrt{2}}

[41010241638]12=[42102102242162382][2.8287.0717.07116.97011.31426.870]\begin{bmatrix}4 & 10 \\ 10 & 24 \\ 16 & 38 \end{bmatrix} \cdot \dfrac{1}{\sqrt{2}} = \begin{bmatrix}\dfrac{4}{\sqrt{2}} & \dfrac{10}{\sqrt{2}} \\[10pt] \dfrac{10}{\sqrt{2}} & \dfrac{24}{\sqrt{2}} \\[10pt] \dfrac{16}{\sqrt{2}} & \dfrac{38}{\sqrt{2}} \end{bmatrix} \approx \begin{bmatrix}2.828 & 7.071 \\ 7.071 & 16.970 \\ 11.314 & 26.870 \end{bmatrix}

(3) Softmax 的表达式为

softmax(zi)=ezijezj\mathrm{softmax}(z_i) = \dfrac{e^{z_i}}{\sum_{j} e^{z_j}}

对第一行应用 Softmax:

softmax([42102])=[e42e42+e102,e102e42+e102]=[11+e62,e621+e62][0.0142,0.9858]\begin{array}{ll}\mathrm{softmax}\left(\begin{bmatrix}\dfrac {4}{\sqrt{2}} &\dfrac {10}{\sqrt{2}}\end{bmatrix}\right) &= \left[\dfrac{e^{\frac{4}{\sqrt{2}}}}{e^{\frac{4}{\sqrt{2}}} + e^{\frac{10}{\sqrt{2}}}}, \dfrac{e^{\frac{10}{\sqrt{2}}}}{e^{\frac{4}{\sqrt{2}}} + e^{\frac{10}{\sqrt{2}}}}\right] \\ \\&= \left[\dfrac 1{1 + e^{\frac{6}{\sqrt{2}}}}, \dfrac {e^{\frac{6}{\sqrt{2}}}}{1 + e^{\frac{6}{\sqrt{2}}}}\right] \\ \\&\approx [0.0142, 0.9858]\end{array}

同理可以计算剩下两行,得到注意力权重矩阵:

A[0.01420.98585.08e050.999951.76e070.9999998]A \approx \begin{bmatrix}0.0142 & 0.9858 \\ 5.08e-05 & 0.99995 \\ 1.76e-07 & 0.9999998 \end{bmatrix}

(4) 计算最终输出:

AV[0.01420.98585.08e050.999951.76e070.9999998][0.51.01.52.0][1.48581.98581.499951.999951.49999981.9999998]AV \approx \begin{bmatrix}0.0142 & 0.9858 \\ 5.08e-05 & 0.99995 \\ 1.76e-07 & 0.9999998 \end{bmatrix} \begin{bmatrix}0.5 & 1.0\\1.5 & 2.0\end{bmatrix} \approx \begin{bmatrix}1.4858 & 1.9858 \\ 1.49995 & 1.99995 \\ 1.4999998 & 1.9999998 \end{bmatrix}

(5) 最终输出矩阵的形状是 3×23\times 2。这是合理的,因为输入的查询矩阵 QQ33 行(表示 33 个查询向量),而值矩阵 VV22 列(表示每个值向量的维度)。注意力机制的输出应该与查询的数量相匹配,同时保留值向量的维度,因此输出形状为 3×23\times 2 是符合预期的。

# 7

The multi-head attention mechanism is defined as:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)W^O

where each head is computed as:

headi=Attention(QWiQ,KWiK,VWiV)\mathrm{head}_i = \mathrm{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Given:

  • Q,K,VR4×8Q,K,V \in \mathbb R^{4\times 8} (sequence length 4, embedding dimension 8),
  • Number of heads h=2h=2,
  • dk=dv=dmodel/h=4d_k = d_v = d_{\mathrm{model}}/h = 4.
  1. What must be the dimensions of the projection matrices WiQ,WiK,WiVW_i^Q, W_i^K, W_i^V, and WOW^O?
  2. If the output of the concatenated heads is CR4×8C\in \mathbb R^{4\times 8}, what operations are needed to ensure the final output has the same dimensions as the input?
  3. Write the complete mathematical expression for one head of the multi-head attention, showing all matrix dimensions.

(1) 分头数为 h=2h=2,每个头的维度为 dk=dv=4d_k = d_v = 4。因此,投影矩阵将词元

  • WiQW_i^Q:将查询从维度 88 投影到维度 44,所以 WiQR8×4W_i^Q \in \mathbb R^{8\times 4}
  • WiKW_i^K:将键从维度 88 投影到维度 44,所以 WiKR8×4W_i^K \in \mathbb R^{8\times 4}
  • WiVW_i^V:将值从维度 88 投影到维度 44,所以 WiVR8×4W_i^V \in \mathbb R^{8\times 4}
  • WOW^O:将连接后的头从维度 88 投影回维度 88,所以 WOR8×8W^O \in \mathbb R^{8\times 8}

(2) 连接后的头 CC 已经是 4×84\times 8 的矩阵,与输入 Q,K,VQ,K,V 的维度相同,因此不需要额外的操作来调整维度。只需将 CC 乘以输出投影矩阵 WOW^O 即可得到最终输出,保持维度为 4×84\times 8

(3) 对于一个头 headi\mathrm{head}_i,其完整的数学表达式为(而后转入维度计算):

headi=Attention(QWiQ,KWiK,VWiV)=softmax((QWiQ)(KWiK)Tdk)(VWiV)=softmax((4×8)(8×4)T)(4×4)=softmax((4×4))(4×4)=(4×4)(4×4)=(4×4)\begin{array}{ll}\mathrm{head}_i &= \mathrm{Attention}(QW_i^Q, KW_i^K, VW_i^V) \\ \\&= \mathrm{softmax}\left(\dfrac{(QW_i^Q)(KW_i^K)^T}{\sqrt{d_k}}\right)(VW_i^V) \\ \\&= \mathrm{softmax}\left((4\times 8)(8\times 4)^T\right)(4\times 4)\\ \\ &= \mathrm{softmax}\left((4\times 4)\right)(4\times 4) \\ \\&= (4\times 4)(4\times 4) = (4\times 4)\end{array}

# 8

Consider the gradient flow through the attention mechanism. Let LL be the loss function.

  1. Using the chain rule, express LQ\dfrac{\partial L}{\partial Q} in terms of
    • LOutput\dfrac{\partial L}{\partial \mathrm{Output}},
    • The attention weights matrix A=softmax(QKTdk)A=\mathrm{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right),
    • The value matrix VV,
    • The key matrix KK.
  2. The softmax derivative for a vector zz is below, where δij\delta_{ij} is the Kronecker delta. How does this affect gradient computation in attention?

softmax(z)izj=softmax(z)i(δijsoftmax(z)j)\dfrac{\partial \mathrm{softmax}(z)_i}{\partial z_j} = \mathrm{softmax}(z)_i (\delta_{ij} - \mathrm{softmax}(z)_j)

  1. Explain why the scaling factor 1dk\dfrac{1}{\sqrt{d_k}} helps with gradient stability during training.

(1) 根据链式法则(是否有问题?)

LQ=LOutputOutputAAQ=LOutputVTQsoftmax(QKTdk)=LOutputVTsoftmax(QKTdk)(Isoftmax(QKTdk)T)Kdk\begin{array}{ll}\dfrac{\partial L}{\partial Q} &= \dfrac{\partial L}{\partial \mathrm{Output}} \cdot \dfrac{\partial \mathrm{Output}}{\partial A} \cdot \dfrac{\partial A}{\partial Q} \\ \\&= \dfrac {\partial L}{\partial \mathrm{Output}} \cdot V^T \cdot \dfrac{\partial }{\partial Q} \mathrm{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right) \\ \\&= \dfrac {\partial L}{\partial \mathrm{Output}} \cdot V^T \cdot \mathrm{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right) \cdot \left(I - \mathrm{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)^T\right) \cdot \dfrac{K}{\sqrt{d_k}}\end{array}

(2) Softmax 的导数形式表明,梯度的计算不仅依赖于当前元素的 softmax 值,还依赖于其他元素的 softmax 值。这意味着在计算梯度时,注意力权重之间存在相互影响,这可能导致梯度在某些情况下变得不稳定,尤其是在处理大规模数据或长序列时。

(3) 缩放因子 1dk\dfrac{1}{\sqrt{d_k}} 有助于稳定梯度,因为它减小了点积 QKTQK^T 的幅度,防止其值过大。过大的点积会导致 softmax 函数的输出趋近于极端值(接近 0011),从而使得梯度变得非常小,导致梯度消失问题。通过缩放,可以保持点积在一个合理的范围内,从而确保 softmax 输出的分布更加均匀,有助于梯度的稳定传播和有效更新模型参数。