Quay lại trang chủ10/25 trong danh mục

neural-fundamentals

Regularization

Chính quy hóa

Độ khóintermediate

1Dự đoán1/8

Đội bóng có 1 ngôi sao ghi 90% bàn thắng. Khi ngôi sao chấn thương, đội thua liên tục. Huấn luyện viên nên làm gì?

2Khám phá2/8

Hình minh họa

Chọn kỹ thuật regularization

Trọng số tự do — dễ overfitting với mạng lớn và dữ liệu nhỏ.

Trọng số = 0

0/12

Trung bình |w|

0.950

Max |w|

2.100

‖w‖₂

3.873

L1 vs L2 vs ElasticNet — sparse vs smooth

100 feature nhưng chỉ 10 thực sự quan trọng. L1 ăn điểm: tự động loại 90 feature dư thừa.

L1 (Lasso)

Non-zero: 4RMSE: 0.28

L2 (Ridge)

Non-zero: 4RMSE: 0.23

ElasticNet (L1 + L2)

Non-zero: 4RMSE: 0.28

Bài toán sparse: L1 thắng nhờ đẩy đúng các feature thừa về 0. L2 thu nhỏ mọi thứ đều nên không loại được feature nào.

Dropout visualization — nơ-ron bị tắt ngẫu nhiên mỗi bước

Bước training: 1 · p = 0.4 · mỗi bước bộ nơ-ron bị tắt lại khác.

Khi inference, TẤT CẢ nơ-ron đều bật. Output được scale (inverted dropout trong PyTorch: scale ÷ (1-p) ở train time, không làm gì ở eval time).

3Khoảnh khắc Aha3/8

Regularizationlà cách nói "đừng quá phức tạp" với mạng nơ-ron. L1 loại bỏ trọng số thừa (cắt cầu thủ dự bị), L2 chia đều vai trò (ai cũng chơi), Dropout cho nghỉ phép ngẫu nhiên (đội mạnh ở mọi đội hình). BatchNorm chuẩn hoá áp lực, Data Augmentation mở rộng sân tập, Early Stopping gọi về khi đội bắt đầu mệt, Label Smoothing nhắc cầu thủ đừng quá tự tin.

4Thử thách 14/8

Bạn dùng Dropout(0.5) khi huấn luyện. Khi deploy mô hình cho người dùng (inference), Dropout hoạt động thế nào?

5Giải thích chi tiết5/8

Giải thích

Regularization là tập hợp kỹ thuật thêm ràng buộc vào quá trình huấn luyện để ngăn mô hình học quá sát với dữ liệu train (overfitting). Nguyên lý chung: đánh đổi một chút bias lấy rất nhiều variance giảm — theo lý thuyết bias-variance tradeoff. Công thức tổng quát:

L_{\text{total}}(\theta) = L_{\text{data}}(\theta) + \lambda \cdot R(\theta)

Trong đó $L_{\text{data}}$ là loss chính (MSE, cross-entropy…), $R(\theta)$ là regularizer, và $\lambda$ điều khiển cường độ. Các kỹ thuật khác nhau = các lựa chọn khác nhau cho $R$ hoặc cách chèn nó vào pipeline training.

L1 Regularization (Lasso):

R_{L1}(\theta) = \sum_{i} |w_i|, \quad L = L_{\text{data}} + \lambda \sum_{i}|w_i|

Gradient của $|w|$ là $\text{sign}(w)$ — không phụ thuộc độ lớn. Điều này tạo áp lực cố định đẩy mọi trọng số về 0, kể cả trọng số nhỏ. Kết quả: sparsity (nhiều $w_i = 0$ ), mô hình tự động feature selection. Cập nhật có dạng soft-thresholding:

w \leftarrow \text{sign}(w) \cdot \max(|w| - \eta\lambda, 0)

L2 Regularization (Ridge / Weight Decay):

R_{L2}(\theta) = \frac{1}{2}\sum_{i} w_i^2, \quad L = L_{\text{data}} + \frac{\lambda}{2} \sum_{i}w_i^2

Gradient của $\tfrac{1}{2}w^2$ là $w$ — càng lớn càng bị kéo. Cập nhật có dạng multiplicative decay:

w \leftarrow w(1 - \eta\lambda) - \eta \nabla L_{\text{data}}

Thu nhỏ tất cả trọng số đều nhưng hiếm khi về 0 → phân bổ đều vai trò. Đây là kỹ thuật phổ biến nhất trong thực tế vì easy to tune và tương thích với mọi optimizer. AdamW (Adam + decoupled weight decay) là lựa chọn mặc định hiện đại.

ElasticNet (L1 + L2):

R_{EN}(\theta) = \alpha \sum_i |w_i| + (1-\alpha) \frac{1}{2} \sum_i w_i^2

Kết hợp sparsity của L1 với stability của L2. Đặc biệt hữu ích khi feature có tương quan: L1 một mình có xu hướng chọn ngẫu nhiên 1 trong nhóm feature tương quan và loại các feature còn lại; ElasticNet giữ cả nhóm và thu nhỏ đều.

L1 vs L2 vs ElasticNet — khi nào dùng cái nào

Sparse problem: chỉ 1 phần nhỏ feature thực sự quan trọng → L1 ăn điểm (feature selection tự động). VD: genomics (hàng chục nghìn gene, chỉ vài chục quan trọng), text classification (hàng triệu n-gram).

Smooth problem: mọi feature đóng góp nhỏ nhưng đều → L2 ăn điểm (phân bổ đều). VD: image classification với CNN (mỗi weight đóng góp nhỏ vào feature detector), nhiều bài toán deep learning chuẩn.

Correlated features: nhóm feature tương quan (ví dụ nhiều biến đo cùng một hiện tượng) → ElasticNet. L1 một mình chỉ giữ 1 trong nhóm (arbitrary), ElasticNet giữ cả nhóm nhờ L2 stabilize.

Dropout:

h^{(l)}_{\text{train}} = m \odot h^{(l)}, \quad m_i \sim \text{Bernoulli}(1-p)

Mỗi forward pass ở train, mask ngẫu nhiên $m$ được sample — $p$ phần nơ-ron bị tắt (output = 0). Ở inference, mọi nơ-ron bật. Để expected output khớp, PyTorch dùng inverted dropout: chia cho $(1-p)$ ở train thay vì nhân ở eval. Dropout tương đương trung bình cộng của $2^n$ mạng con — một dạng model averaging cực lớn và rẻ.

DropConnect: Thay vì tắt neuron (output), tắt connection (weight). Mask áp dụng lên $W$ thay vì lên activation. Tổng quát hơn Dropout nhưng tốn memory hơn vì mỗi connection cần mask riêng. Trong thực tế hiếm dùng vì Dropout đơn giản và đủ tốt.

BatchNorm như regularizer: BatchNorm chuẩn hoá activation theo batch để giảm internal covariate shift. Hiệu ứng regularization đến từ việc $\mu_B, \sigma_B$ được tính trên batch ngẫu nhiên — batch khác cho statistics hơi khác, tạo noise injection nhẹ vào mỗi forward pass. Đây là dropout-like noise, nên thường khi dùng BN có thể giảm dropout rate.

Data Augmentation: Tăng số lượng training sample bằng biến đổi bảo toàn label:

\tilde{x} = T(x), \quad T \in \mathcal{T}_{\text{invariance}}

Với ảnh: flip, crop, rotate, color jitter, cutout, mixup, CutMix. Với audio: pitch shift, time stretch, SpecAugment. Với text: synonym replacement, back-translation. Đây có thể là kỹ thuật regularization MẠNH nhất — với coverage data đủ, nhiều khi không cần L2 hoặc dropout.

Early Stopping:

\theta^* = \theta_t \text{ where } t = \arg\min_{t} L_{\text{val}}(\theta_t)

Theo dõi val loss mỗi epoch; dừng train khi val loss không giảm nữa (với patience). Về mặt lý thuyết, early stopping tương đương L2 implicit — train time giới hạn = capacity giới hạn.

Label Smoothing: Thay vì target one-hot cứng, dùng target mềm:

y^{\text{smooth}}_k = (1-\alpha) \cdot y^{\text{onehot}}_k + \frac{\alpha}{K}

Với $K$ classes và $\alpha \approx 0.1$ . Ngăn output quá confident (logit → ∞), cải thiện calibration và generalization. Phổ biến trong CV (ResNet, Vision Transformer) và NLP (Transformer). Khi dùng cross-entropy với one-hot target, model có xu hướng đẩy logit về ±∞ để gradient đạt 0 — điều này không ổn định trong finite precision.

Bảng tổng hợp — các kỹ thuật và khi nào dùng

L1: sparse problem, feature selection, mô hình cần interpretable.

L2 / Weight Decay: mặc định cho mọi deep model. Luôn dùng (λ ≈ 1e-4 đến 1e-2).

ElasticNet: feature có tương quan cao (ví dụ statistics, biomedical).

Dropout: fully-connected layer, RNN. Ít dùng cho CNN sau khi có BN.

DropConnect: hiếm trong thực tế; dùng khi Dropout không đủ và bạn muốn noise chi tiết hơn.

BatchNorm: mặc định cho CNN. LayerNorm cho Transformer/RNN.

Data Augmentation: luôn dùng khi có thể — mạnh nhất trong CV.

Early Stopping: luôn dùng — rẻ, an toàn.

Label Smoothing: classification với nhiều class, giúp calibration.

regularization_full.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import transforms

# ───────────────────────────────────────────────────────────────
# 1) L2 weight decay: thêm vào optimizer
#    (Adam + weight decay đúng cách = AdamW — decoupled WD)
# ───────────────────────────────────────────────────────────────
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,  # λ cho L2
)

# ───────────────────────────────────────────────────────────────
# 2) L1 regularization: thêm thủ công vào loss
# ───────────────────────────────────────────────────────────────
def l1_penalty(model, lambda_l1=1e-5):
    return lambda_l1 * sum(
        p.abs().sum() for p in model.parameters() if p.requires_grad
    )

# training loop:
loss = criterion(output, target) + l1_penalty(model)

# ───────────────────────────────────────────────────────────────
# 3) ElasticNet: kết hợp L1 + L2
# ───────────────────────────────────────────────────────────────
def elastic_penalty(model, l1=1e-5, l2=1e-4, alpha=0.5):
    l1_term = sum(p.abs().sum() for p in model.parameters())
    l2_term = sum((p ** 2).sum() for p in model.parameters())
    return alpha * l1 * l1_term + (1 - alpha) * l2 * l2_term

# ───────────────────────────────────────────────────────────────
# 4) Dropout + BatchNorm trong model definition
# ───────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),   # BN trước activation
    nn.ReLU(),
    nn.Dropout(0.3),       # tắt 30% sau activation
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(128, 10),
)

# Quan trọng: bật/tắt Dropout và BN khi train/eval
model.train()  # Dropout ON, BN dùng batch stats
model.eval()   # Dropout OFF, BN dùng running stats

# ───────────────────────────────────────────────────────────────
# 5) Data Augmentation (torchvision)
# ───────────────────────────────────────────────────────────────
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.4, 0.4, 0.4),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.25),  # Cutout
])

# Mixup (trong training loop)
def mixup_data(x, y, alpha=0.2):
    lam = torch.distributions.Beta(alpha, alpha).sample().item()
    idx = torch.randperm(x.size(0))
    mixed_x = lam * x + (1 - lam) * x[idx]
    y_a, y_b = y, y[idx]
    return mixed_x, y_a, y_b, lam

early_stopping_label_smoothing.py

# ───────────────────────────────────────────────────────────────
# 6) Early Stopping helper
# ───────────────────────────────────────────────────────────────
class EarlyStopping:
    def __init__(self, patience=7, min_delta=0.0, restore_best=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best
        self.counter = 0
        self.best_loss = float("inf")
        self.best_state = None
        self.should_stop = False

    def step(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            if self.restore_best:
                self.best_state = {
                    k: v.detach().clone()
                    for k, v in model.state_dict().items()
                }
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
                if self.restore_best and self.best_state is not None:
                    model.load_state_dict(self.best_state)

# Usage:
stopper = EarlyStopping(patience=5)
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    stopper.step(val_loss, model)
    if stopper.should_stop:
        print(f"Early stopping at epoch {epoch}")
        break

# ───────────────────────────────────────────────────────────────
# 7) Label Smoothing
# ───────────────────────────────────────────────────────────────
# Cách 1 — có sẵn trong PyTorch 1.10+
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Cách 2 — tự viết (dùng được cho custom loss)
def smooth_cross_entropy(logits, target, smoothing=0.1):
    n_classes = logits.size(-1)
    log_probs = F.log_softmax(logits, dim=-1)
    # target one-hot
    with torch.no_grad():
        true_dist = torch.zeros_like(log_probs)
        true_dist.fill_(smoothing / (n_classes - 1))
        true_dist.scatter_(1, target.unsqueeze(1), 1 - smoothing)
    return -(true_dist * log_probs).sum(dim=-1).mean()

# ───────────────────────────────────────────────────────────────
# 8) DropConnect (không có trong PyTorch — tự implement)
# ───────────────────────────────────────────────────────────────
class DropConnectLinear(nn.Module):
    def __init__(self, in_features, out_features, p=0.3):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.p = p

    def forward(self, x):
        if self.training and self.p > 0:
            mask = torch.bernoulli(
                torch.full_like(self.linear.weight, 1 - self.p)
            )
            W = self.linear.weight * mask / (1 - self.p)  # inverted
            return F.linear(x, W, self.linear.bias)
        return self.linear(x)

Combo chống overfitting phổ biến nhất

L2 (AdamW weight_decay=1e-4) + Dropout(0.1-0.3) + BatchNorm + Data Augmentation + Early Stopping + Label Smoothing(0.1) là "six-pack" tiêu chuẩn cho CNN/Transformer lớn. Mỗi kỹ thuật regularize theo một kênh khác nhau (weight magnitude, activation noise, input diversity, training time, output sharpness) — chúng KHÔNG thay thế nhau, mà hỗ trợ nhau.

Khi nào regularization không đủ

Nếu val loss vẫn cao hơn train loss nhiều (overfit nặng) dù đã dùng đủ kỹ thuật, nguyên nhân thường là:

1. Data không đủ — đi thu thập thêm, đó là "regularization" mạnh nhất.

2. Model quá lớn — giảm depth/width; không phải bài toán nào cũng cần ResNet-152.

3. Data leakage — val set bị rò rỉ vào train (duplicates, feature lookahead).

4. Distribution mismatch — train và val/test không cùng distribution; regularization không sửa được.

Ứng dụng thực tế

Computer Vision: ResNet/ViT dùng combo weight decay 1e-4 + label smoothing 0.1 + heavy augmentation (RandAugment, Mixup, CutMix). Dropout ít dùng ở CNN hiện đại (BN thay thế phần lớn), nhưng ViT vẫn dùng Dropout trong attention và MLP.

NLP: Transformer dùng dropout 0.1 ở attention và feedforward, label smoothing trong seq2seq, weight decay 0.01-0.1, early stopping theo dev metric.

Tabular: L1/L2 là công cụ chính (scikit-learn LogisticRegression, Ridge, Lasso). XGBoost/LightGBM có reg_alpha, reg_lambda tương ứng L1, L2 trên leaf weights.

Tiếng Việt: VinAI khi train PhoBERT dùng weight decay 0.01, dropout 0.1, data augmentation bằng back- translation (Việt → Anh → Việt).

Pitfalls — sai lầm phổ biến

Regularize BN params: Không nên weight decay các tham số $\gamma, \beta$ của BatchNorm hay LayerNorm — dùng param group riêng với wd=0.

Dropout trước softmax: Không. Dropout ở output layer làm mất thông tin probabilistic. Đặt dropout ở hidden layer.

Quên model.eval(): Rất phổ biến — inference với dropout bật = output ngẫu nhiên, accuracy giảm vô lý.

Dùng L1 mạnh khi features không sparse: L1 loại bỏ cả feature quan trọng nếu chúng đóng góp nhỏ.

Early stopping không restore best: Dừng ở epoch N nhưng không restore weights ở epoch best → mất toàn bộ lợi ích.

Tăng dropout khi underfit: Dropout giảm capacity. Nếu model chưa học đủ (train loss cao), dropout sẽ làm tệ hơn.

6Thử thách 26/8

Bạn có 100 features nhưng nghi ngờ chỉ 10 features thực sự hữu ích. Nên dùng regularization nào?

7Tóm tắt7/8

Regularization — Điểm chốt

Regularization = thêm ràng buộc vào quá trình train để ngăn overfitting, đánh đổi ít bias lấy nhiều variance giảm.
L1 (Lasso): sparse, feature selection. L2 (Ridge/weight decay): smooth, mặc định. ElasticNet: kết hợp khi feature có tương quan.
Dropout: tắt ngẫu nhiên p% nơ-ron khi train, mọi nơ-ron bật khi inference. DropConnect: tắt weight thay vì neuron.
BatchNorm có hiệu ứng regularization nhẹ nhờ noise từ batch statistics. Data Augmentation là regularization MẠNH nhất qua input.
Early Stopping dừng train khi val loss tăng — tương đương L2 implicit. Label Smoothing ngăn output quá confident, cải thiện calibration.
Combo chuẩn: AdamW(wd=1e-4) + Dropout(0.1-0.3) + BN + Augmentation + EarlyStopping + LabelSmoothing(0.1) — các kỹ thuật hỗ trợ chứ không thay thế nhau.

8Kiểm tra8/8

Kiểm tra hiểu biết

Câu 1/8

L1 regularization tạo ra trọng số thưa (sparse). Điều này có lợi gì?

Chủ đề liên quan

Overfitting & Underfitting: Overfit và underfit: hai cách học sai của model Loss Functions: Hàm loss: điểm số của model Batch Normalization: Chuẩn hóa theo lô

Hình minh họa

Chọn kỹ thuật regularization

Trọng số tự do — dễ overfitting với mạng lớn và dữ liệu nhỏ.

Trọng số = 0

0/12

Trung bình |w|

0.950

Max |w|

2.100

‖w‖₂

3.873

L1 vs L2 vs ElasticNet — sparse vs smooth

100 feature nhưng chỉ 10 thực sự quan trọng. L1 ăn điểm: tự động loại 90 feature dư thừa.

L1 (Lasso)

Non-zero: 4RMSE: 0.28

L2 (Ridge)

Non-zero: 4RMSE: 0.23

ElasticNet (L1 + L2)

Non-zero: 4RMSE: 0.28

Bài toán sparse: L1 thắng nhờ đẩy đúng các feature thừa về 0. L2 thu nhỏ mọi thứ đều nên không loại được feature nào.

Dropout visualization — nơ-ron bị tắt ngẫu nhiên mỗi bước

Bước training: 1 · p = 0.4 · mỗi bước bộ nơ-ron bị tắt lại khác.

Khi inference, TẤT CẢ nơ-ron đều bật. Output được scale (inverted dropout trong PyTorch: scale ÷ (1-p) ở train time, không làm gì ở eval time).

Giải thích

L_{\text{total}}(\theta) = L_{\text{data}}(\theta) + \lambda \cdot R(\theta)

L1 Regularization (Lasso):

R_{L1}(\theta) = \sum_{i} |w_i|, \quad L = L_{\text{data}} + \lambda \sum_{i}|w_i|

w \leftarrow \text{sign}(w) \cdot \max(|w| - \eta\lambda, 0)

L2 Regularization (Ridge / Weight Decay):

R_{L2}(\theta) = \frac{1}{2}\sum_{i} w_i^2, \quad L = L_{\text{data}} + \frac{\lambda}{2} \sum_{i}w_i^2

Gradient của $\tfrac{1}{2}w^2$ là $w$ — càng lớn càng bị kéo. Cập nhật có dạng multiplicative decay:

w \leftarrow w(1 - \eta\lambda) - \eta \nabla L_{\text{data}}

ElasticNet (L1 + L2):

R_{EN}(\theta) = \alpha \sum_i |w_i| + (1-\alpha) \frac{1}{2} \sum_i w_i^2

L1 vs L2 vs ElasticNet — khi nào dùng cái nào

Dropout:

h^{(l)}_{\text{train}} = m \odot h^{(l)}, \quad m_i \sim \text{Bernoulli}(1-p)

Data Augmentation: Tăng số lượng training sample bằng biến đổi bảo toàn label:

\tilde{x} = T(x), \quad T \in \mathcal{T}_{\text{invariance}}

Early Stopping:

\theta^* = \theta_t \text{ where } t = \arg\min_{t} L_{\text{val}}(\theta_t)

Label Smoothing: Thay vì target one-hot cứng, dùng target mềm:

y^{\text{smooth}}_k = (1-\alpha) \cdot y^{\text{onehot}}_k + \frac{\alpha}{K}

Bảng tổng hợp — các kỹ thuật và khi nào dùng

L1: sparse problem, feature selection, mô hình cần interpretable.

L2 / Weight Decay: mặc định cho mọi deep model. Luôn dùng (λ ≈ 1e-4 đến 1e-2).

ElasticNet: feature có tương quan cao (ví dụ statistics, biomedical).

Dropout: fully-connected layer, RNN. Ít dùng cho CNN sau khi có BN.

DropConnect: hiếm trong thực tế; dùng khi Dropout không đủ và bạn muốn noise chi tiết hơn.

BatchNorm: mặc định cho CNN. LayerNorm cho Transformer/RNN.

Data Augmentation: luôn dùng khi có thể — mạnh nhất trong CV.

Early Stopping: luôn dùng — rẻ, an toàn.

Label Smoothing: classification với nhiều class, giúp calibration.

regularization_full.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import transforms

# ───────────────────────────────────────────────────────────────
# 1) L2 weight decay: thêm vào optimizer
#    (Adam + weight decay đúng cách = AdamW — decoupled WD)
# ───────────────────────────────────────────────────────────────
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,  # λ cho L2
)

# ───────────────────────────────────────────────────────────────
# 2) L1 regularization: thêm thủ công vào loss
# ───────────────────────────────────────────────────────────────
def l1_penalty(model, lambda_l1=1e-5):
    return lambda_l1 * sum(
        p.abs().sum() for p in model.parameters() if p.requires_grad
    )

# training loop:
loss = criterion(output, target) + l1_penalty(model)

# ───────────────────────────────────────────────────────────────
# 3) ElasticNet: kết hợp L1 + L2
# ───────────────────────────────────────────────────────────────
def elastic_penalty(model, l1=1e-5, l2=1e-4, alpha=0.5):
    l1_term = sum(p.abs().sum() for p in model.parameters())
    l2_term = sum((p ** 2).sum() for p in model.parameters())
    return alpha * l1 * l1_term + (1 - alpha) * l2 * l2_term

# ───────────────────────────────────────────────────────────────
# 4) Dropout + BatchNorm trong model definition
# ───────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),   # BN trước activation
    nn.ReLU(),
    nn.Dropout(0.3),       # tắt 30% sau activation
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(128, 10),
)

# Quan trọng: bật/tắt Dropout và BN khi train/eval
model.train()  # Dropout ON, BN dùng batch stats
model.eval()   # Dropout OFF, BN dùng running stats

# ───────────────────────────────────────────────────────────────
# 5) Data Augmentation (torchvision)
# ───────────────────────────────────────────────────────────────
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.4, 0.4, 0.4),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.25),  # Cutout
])

# Mixup (trong training loop)
def mixup_data(x, y, alpha=0.2):
    lam = torch.distributions.Beta(alpha, alpha).sample().item()
    idx = torch.randperm(x.size(0))
    mixed_x = lam * x + (1 - lam) * x[idx]
    y_a, y_b = y, y[idx]
    return mixed_x, y_a, y_b, lam

early_stopping_label_smoothing.py

# ───────────────────────────────────────────────────────────────
# 6) Early Stopping helper
# ───────────────────────────────────────────────────────────────
class EarlyStopping:
    def __init__(self, patience=7, min_delta=0.0, restore_best=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best
        self.counter = 0
        self.best_loss = float("inf")
        self.best_state = None
        self.should_stop = False

    def step(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            if self.restore_best:
                self.best_state = {
                    k: v.detach().clone()
                    for k, v in model.state_dict().items()
                }
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
                if self.restore_best and self.best_state is not None:
                    model.load_state_dict(self.best_state)

# Usage:
stopper = EarlyStopping(patience=5)
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    stopper.step(val_loss, model)
    if stopper.should_stop:
        print(f"Early stopping at epoch {epoch}")
        break

# ───────────────────────────────────────────────────────────────
# 7) Label Smoothing
# ───────────────────────────────────────────────────────────────
# Cách 1 — có sẵn trong PyTorch 1.10+
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Cách 2 — tự viết (dùng được cho custom loss)
def smooth_cross_entropy(logits, target, smoothing=0.1):
    n_classes = logits.size(-1)
    log_probs = F.log_softmax(logits, dim=-1)
    # target one-hot
    with torch.no_grad():
        true_dist = torch.zeros_like(log_probs)
        true_dist.fill_(smoothing / (n_classes - 1))
        true_dist.scatter_(1, target.unsqueeze(1), 1 - smoothing)
    return -(true_dist * log_probs).sum(dim=-1).mean()

# ───────────────────────────────────────────────────────────────
# 8) DropConnect (không có trong PyTorch — tự implement)
# ───────────────────────────────────────────────────────────────
class DropConnectLinear(nn.Module):
    def __init__(self, in_features, out_features, p=0.3):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.p = p

    def forward(self, x):
        if self.training and self.p > 0:
            mask = torch.bernoulli(
                torch.full_like(self.linear.weight, 1 - self.p)
            )
            W = self.linear.weight * mask / (1 - self.p)  # inverted
            return F.linear(x, W, self.linear.bias)
        return self.linear(x)

Combo chống overfitting phổ biến nhất

Khi nào regularization không đủ

Nếu val loss vẫn cao hơn train loss nhiều (overfit nặng) dù đã dùng đủ kỹ thuật, nguyên nhân thường là:

1. Data không đủ — đi thu thập thêm, đó là "regularization" mạnh nhất.

2. Model quá lớn — giảm depth/width; không phải bài toán nào cũng cần ResNet-152.

3. Data leakage — val set bị rò rỉ vào train (duplicates, feature lookahead).

4. Distribution mismatch — train và val/test không cùng distribution; regularization không sửa được.

Ứng dụng thực tế

NLP: Transformer dùng dropout 0.1 ở attention và feedforward, label smoothing trong seq2seq, weight decay 0.01-0.1, early stopping theo dev metric.

Tabular: L1/L2 là công cụ chính (scikit-learn LogisticRegression, Ridge, Lasso). XGBoost/LightGBM có reg_alpha, reg_lambda tương ứng L1, L2 trên leaf weights.

Tiếng Việt: VinAI khi train PhoBERT dùng weight decay 0.01, dropout 0.1, data augmentation bằng back- translation (Việt → Anh → Việt).

Pitfalls — sai lầm phổ biến

Regularize BN params: Không nên weight decay các tham số $\gamma, \beta$ của BatchNorm hay LayerNorm — dùng param group riêng với wd=0.

Dropout trước softmax: Không. Dropout ở output layer làm mất thông tin probabilistic. Đặt dropout ở hidden layer.

Quên model.eval(): Rất phổ biến — inference với dropout bật = output ngẫu nhiên, accuracy giảm vô lý.

Dùng L1 mạnh khi features không sparse: L1 loại bỏ cả feature quan trọng nếu chúng đóng góp nhỏ.

Early stopping không restore best: Dừng ở epoch N nhưng không restore weights ở epoch best → mất toàn bộ lợi ích.

Tăng dropout khi underfit: Dropout giảm capacity. Nếu model chưa học đủ (train loss cao), dropout sẽ làm tệ hơn.

Kiểm tra hiểu biết

Câu 1/8

L1 regularization tạo ra trọng số thưa (sparse). Điều này có lợi gì?