경사하강법 실전 (Gradient Descent in Practice)

학습 목표

이 레슨을 완료하면:

•경사하강법의 핵심 원리를 직관적으로 이해합니다
•Batch, Mini-batch, SGD의 차이를 알고 직접 구현합니다
•Momentum의 원리를 이해하고 SGD와 비교합니다
•Adam 옵티마이저가 왜 기본값인지 파악합니다
•학습률 스케줄링의 중요성을 이해합니다
•실전에서 바로 쓸 수 있는 팁을 익힙니다

핵심 메시지

"경사하강법은 눈을 가리고 산을 내려가는 것. 어떻게 걸음을 옮기느냐에 따라 속도와 결과가 완전히 달라집니다!"

1. 비유: 눈을 가린 채 산 내려오기

안개가 자욱한 산꼭대기에 서 있다고 상상해 봅시다. 앞이 보이지 않아서, 발 밑의 경사(기울기) 만 느낄 수 있습니다.

어떻게 산 아래로 내려갈까요?

전략: "발 밑이 가장 가파르게 내려가는 방향으로 한 걸음씩 이동!"

이것이 바로 경사하강법(Gradient Descent) 입니다.

핵심 공식:

$w_{new} = w_{old} - \eta \cdot \nabla L$

$\text{새 위치} = \text{현재 위치} - \text{학습률} \times \text{기울기}$

여기서 $\eta$ 는 학습률(learning rate), $\nabla L$ 은 손실 함수의 기울기(gradient)입니다.

여기서 중요한 질문이 세 가지 생깁니다:

•기울기를 얼마나 많은 데이터로 계산할까? --> Batch vs Mini-batch vs SGD
•과거의 이동 방향을 기억할까? --> Momentum
•보폭을 자동으로 조절할까? --> Adam

하나씩 살펴봅시다!

2. Batch vs Mini-batch vs SGD

데이터가 10,000개 있다고 합시다. 기울기를 계산할 때 데이터를 얼마나 사용하느냐에 따라 세 가지 방식이 나뉩니다.

방식	기울기 계산에 사용하는 데이터	장점	단점
Batch GD	전체 10,000개	안정적, 정확한 기울기	매우 느림, 메모리 많이 사용
SGD	1개	빠른 업데이트	불안정, 지그재그 경로
Mini-batch GD	32~256개 (묶음)	속도와 안정성의 균형	배치 크기 선택 필요

현대 딥러닝에서는 거의 항상 Mini-batch GD를 사용합니다! 보통 "SGD"라고 말해도 실제로는 Mini-batch를 의미하는 경우가 많습니다.

실행해보기: 세 가지 방식 직접 비교

python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# --- 간단한 문제: y = 3x + 2에 노이즈를 추가한 데이터 ---
n_data = 200
X = np.random.randn(n_data)
y_true = 3 * X + 2 + np.random.randn(n_data) * 0.5

# 손실 함수: MSE, 기울기 직접 계산
def compute_loss_and_grad(X_batch, y_batch, w, b):
    pred = w * X_batch + b
    loss = np.mean((pred - y_batch) ** 2)
    grad_w = np.mean(2 * (pred - y_batch) * X_batch)
    grad_b = np.mean(2 * (pred - y_batch))
    return loss, grad_w, grad_b

# --- 1) Batch GD: 매번 전체 데이터 사용 ---
w_batch, b_batch = 0.0, 0.0
lr = 0.05
batch_losses = []

for epoch in range(50):
    loss, gw, gb = compute_loss_and_grad(X, y_true, w_batch, b_batch)
    w_batch -= lr * gw
    b_batch -= lr * gb
    batch_losses.append(loss)

# --- 2) SGD: 매번 데이터 1개씩 사용 ---
w_sgd, b_sgd = 0.0, 0.0
sgd_losses = []

for epoch in range(50):
    indices = np.random.permutation(n_data)
    for i in indices:
        _, gw, gb = compute_loss_and_grad(X[i:i+1], y_true[i:i+1], w_sgd, b_sgd)
        w_sgd -= lr * gw
        b_sgd -= lr * gb
    loss = np.mean((w_sgd * X + b_sgd - y_true) ** 2)
    sgd_losses.append(loss)

# --- 3) Mini-batch GD: 32개씩 묶어서 사용 ---
w_mini, b_mini = 0.0, 0.0
batch_size = 32
mini_losses = []

for epoch in range(50):
    indices = np.random.permutation(n_data)
    for start in range(0, n_data, batch_size):
        idx = indices[start:start+batch_size]
        _, gw, gb = compute_loss_and_grad(X[idx], y_true[idx], w_mini, b_mini)
        w_mini -= lr * gw
        b_mini -= lr * gb
    loss = np.mean((w_mini * X + b_mini - y_true) ** 2)
    mini_losses.append(loss)

# --- 비교 그래프 ---
plt.figure(figsize=(9, 5))
plt.plot(batch_losses, linewidth=2, label='Batch GD (all data)')
plt.plot(sgd_losses, linewidth=2, label='SGD (1 sample)')
plt.plot(mini_losses, linewidth=2, label='Mini-batch GD (32)')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('Batch vs SGD vs Mini-batch Gradient Descent', fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('Final weights:')
print('  Batch GD:     w={:.3f}, b={:.3f}'.format(w_batch, b_batch))
print('  SGD:          w={:.3f}, b={:.3f}'.format(w_sgd, b_sgd))
print('  Mini-batch:   w={:.3f}, b={:.3f}'.format(w_mini, b_mini))
print('  True values:  w=3.000, b=2.000')

세 가지 방법 모두 결국 정답(w=3, b=2)에 수렴하지만, 수렴 경로가 다릅니다!

3. Momentum: 공이 굴러 내려가듯

비유: 공 굴리기

일반 SGD는 매 걸음마다 "지금 이 순간의 기울기"만 봅니다. 그래서 좁은 골짜기에서 지그재그로 느리게 내려갑니다.

Momentum(관성)은 다릅니다. 마치 공을 산에서 굴리는 것과 같습니다:

•공은 내리막에서 점점 빨라집니다 (관성이 쌓임)
•약간의 오르막을 만나도 관성으로 넘어갑니다
•지그재그하지 않고 부드럽게 움직입니다

방식	수식	설명
SGD	$w = w - \\eta \\cdot \\nabla L$	현재 기울기만 사용
Momentum	$v = \\mu \\cdot v - \\eta \\cdot \\nabla L$	과거 방향 + 현재 기울기
	$w = w + v$	관성이 반영된 이동

💡 보통 $\\mu = 0.9$ (과거 속도의 90%를 유지)

실행해보기: SGD vs Momentum 수렴 비교

python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# 타원형 손실 함수 (좁은 골짜기 - SGD가 지그재그하기 좋은 환경)
# f(w1, w2) = 10*w1^2 + w2^2  (w1 방향이 훨씬 가파름)
def loss_fn(w1, w2):
    return 10 * w1**2 + w2**2

def grad_fn(w1, w2):
    return np.array([20 * w1, 2 * w2])

# --- SGD ---
w_sgd = np.array([3.0, 3.0])
lr = 0.04
sgd_path = [w_sgd.copy()]

for _ in range(30):
    g = grad_fn(w_sgd[0], w_sgd[1])
    w_sgd = w_sgd - lr * g
    sgd_path.append(w_sgd.copy())

sgd_path = np.array(sgd_path)

# --- SGD + Momentum ---
w_mom = np.array([3.0, 3.0])
velocity = np.array([0.0, 0.0])
mom_rate = 0.9
mom_path = [w_mom.copy()]

for _ in range(30):
    g = grad_fn(w_mom[0], w_mom[1])
    velocity = mom_rate * velocity - lr * g
    w_mom = w_mom + velocity
    mom_path.append(w_mom.copy())

mom_path = np.array(mom_path)

# --- 등고선 그래프에 경로 표시 ---
w1_range = np.linspace(-4, 4, 200)
w2_range = np.linspace(-4, 4, 200)
W1, W2 = np.meshgrid(w1_range, w2_range)
Z = 10 * W1**2 + W2**2

plt.figure(figsize=(9, 6))
plt.contour(W1, W2, Z, levels=20, cmap='RdYlBu_r', alpha=0.6)
plt.colorbar(label='Loss')

plt.plot(sgd_path[:, 0], sgd_path[:, 1], 'o-', color='red',
         markersize=4, linewidth=1.5, label='SGD (zigzag!)')
plt.plot(mom_path[:, 0], mom_path[:, 1], 's-', color='blue',
         markersize=4, linewidth=1.5, label='SGD + Momentum (smooth)')
plt.plot(0, 0, 'k*', markersize=15, label='Goal (0, 0)')

plt.xlabel('w1', fontsize=12)
plt.ylabel('w2', fontsize=12)
plt.title('SGD vs Momentum on an elliptical loss surface', fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

print('SGD final:      w1={:.4f}, w2={:.4f}'.format(sgd_path[-1, 0], sgd_path[-1, 1]))
print('Momentum final: w1={:.4f}, w2={:.4f}'.format(mom_path[-1, 0], mom_path[-1, 1]))

빨간색(SGD)은 지그재그로 느리게, 파란색(Momentum)은 부드럽게 빠르게 수렴하는 것이 보이시나요?

4. Adam: 현대 딥러닝의 기본 옵티마이저

Adam은 무엇인가?

Adam = Adaptive Moment Estimation Momentum과 RMSprop이라는 두 가지 아이디어를 결합한 것입니다.

Adam의 핵심 아이디어 2가지:

1. Momentum 효과 (1차 모멘트)

•기울기의 이동 평균을 추적 → 방향 안정화
• $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$

2. 적응적 학습률 (2차 모멘트)

•기울기 크기의 이동 평균을 추적 → 보폭 자동 조절
• $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

최종 업데이트:

$w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

여기서 $\hat{m}_t$ , $\hat{v}_t$ 는 편향 보정된 값입니다.

왜 Adam이 기본값인가?

옵티마이저	장점	단점
SGD	단순, 일반화 좋음	학습률 민감, 느림
SGD + Momentum	더 빠른 수렴, 지그재그 감소	학습률 여전히 민감
Adam	학습률에 덜 민감, 빠른 수렴, 적응적	메모리 2배, 가끔 일반화 약함

실전 팁: 처음에는 Adam으로 시작하세요. 대부분의 경우 잘 작동합니다!

실행해보기: SGD vs Momentum vs Adam 직접 비교

python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# 까다로운 손실 함수: 타원형 + 약간의 노이즈
def tricky_loss(w1, w2):
    return 10 * w1**2 + 0.5 * w2**2 + 3 * np.sin(w1 * 2) * np.cos(w2)

def tricky_grad(w1, w2):
    dw1 = 20 * w1 + 6 * np.cos(w1 * 2) * np.cos(w2)
    dw2 = 1.0 * w2 - 3 * np.sin(w1 * 2) * np.sin(w2)
    return np.array([dw1, dw2])

start = np.array([3.0, 3.0])
n_steps = 60

# --- SGD ---
w = start.copy()
lr_sgd = 0.01
sgd_losses = []
for _ in range(n_steps):
    sgd_losses.append(tricky_loss(w[0], w[1]))
    g = tricky_grad(w[0], w[1])
    w = w - lr_sgd * g

# --- SGD + Momentum ---
w = start.copy()
v = np.zeros(2)
lr_mom = 0.01
mom_losses = []
for _ in range(n_steps):
    mom_losses.append(tricky_loss(w[0], w[1]))
    g = tricky_grad(w[0], w[1])
    v = 0.9 * v - lr_mom * g
    w = w + v

# --- Adam ---
w = start.copy()
m = np.zeros(2)  # 1st moment
s = np.zeros(2)  # 2nd moment
lr_adam = 0.1
beta1, beta2, eps = 0.9, 0.999, 1e-8
adam_losses = []
for t in range(1, n_steps + 1):
    adam_losses.append(tricky_loss(w[0], w[1]))
    g = tricky_grad(w[0], w[1])
    m = beta1 * m + (1 - beta1) * g          # update 1st moment
    s = beta2 * s + (1 - beta2) * g**2       # update 2nd moment
    m_hat = m / (1 - beta1**t)               # bias correction
    s_hat = s / (1 - beta2**t)               # bias correction
    w = w - lr_adam * m_hat / (np.sqrt(s_hat) + eps)

# --- 비교 그래프 ---
plt.figure(figsize=(9, 5))
plt.plot(sgd_losses, linewidth=2, label='SGD (lr=0.01)')
plt.plot(mom_losses, linewidth=2, label='Momentum (lr=0.01, m=0.9)')
plt.plot(adam_losses, linewidth=2, label='Adam (lr=0.1)')
plt.xlabel('Step', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('SGD vs Momentum vs Adam on a tricky surface', fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('Final loss:')
print('  SGD:      {:.4f}'.format(sgd_losses[-1]))
print('  Momentum: {:.4f}'.format(mom_losses[-1]))
print('  Adam:     {:.4f}'.format(adam_losses[-1]))

Adam이 같은 스텝 수에서 더 낮은 손실에 도달하는 것을 볼 수 있습니다!

5. 학습률 스케줄링

학습률을 처음부터 끝까지 고정하는 것보다, 점점 줄이는 것이 더 좋을 때가 많습니다.

비유: 집 찾기

단계	보폭	학습률
🏃 처음	큰 보폭으로 대략적인 위치로 이동	크게
🚶 중간	보폭을 줄여서 세밀하게 조정	줄이기
🎯 마지막	아주 작은 보폭으로 정확한 위치 도달	아주 작게

주요 스케줄링 방식

방식	설명	사용 시기
Step Decay	일정 epoch마다 학습률을 절반으로	가장 간단, 자주 사용
Cosine Annealing	코사인 곡선처럼 부드럽게 감소	최신 논문에서 인기
Warm-up	처음에 작게 시작하여 점점 키움	대규모 모델(BERT, GPT)

실행해보기: 학습률 스케줄링 효과

python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# 간단한 2차 함수 최적화: f(w) = (w - 5)^2
def loss_fn(w):
    return (w - 5.0) ** 2

def grad_fn(w):
    return 2 * (w - 5.0)

n_steps = 100
w_start = -3.0

# --- 고정 학습률 ---
w = w_start
lr_fixed = 0.1
fixed_losses = []
for _ in range(n_steps):
    fixed_losses.append(loss_fn(w))
    w = w - lr_fixed * grad_fn(w)

# --- Step Decay (30 epoch마다 절반) ---
w = w_start
lr = 0.3
step_losses = []
for t in range(n_steps):
    step_losses.append(loss_fn(w))
    if t == 30:
        lr *= 0.5
    if t == 60:
        lr *= 0.5
    w = w - lr * grad_fn(w)

# --- Cosine Annealing ---
w = w_start
lr_max = 0.3
cosine_losses = []
for t in range(n_steps):
    cosine_losses.append(loss_fn(w))
    lr = lr_max * 0.5 * (1 + np.cos(np.pi * t / n_steps))
    w = w - lr * grad_fn(w)

plt.figure(figsize=(9, 5))
plt.plot(fixed_losses, linewidth=2, label='Fixed LR (0.1)')
plt.plot(step_losses, linewidth=2, label='Step Decay (0.3, halve at 30,60)')
plt.plot(cosine_losses, linewidth=2, label='Cosine Annealing (max=0.3)')
plt.xlabel('Step', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Effect of Learning Rate Scheduling', fontsize=13)
plt.legend(fontsize=11)
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

스케줄링을 사용하면 초반에는 빠르게 수렴하고, 후반에는 세밀하게 조정할 수 있습니다!

6. 실전 팁 모음

옵티마이저 선택 가이드

상황	추천 옵티마이저	학습률
처음 시작, 빠른 실험	Adam	0.001 (기본값)
이미지 분류 (CNN)	SGD + Momentum	0.01~0.1
자연어 처리 (Transformer)	Adam 또는 AdamW	0.0001~0.001
학습이 불안정할 때	학습률 줄이기	현재의 1/10
과적합이 심할 때	AdamW (weight decay)	0.001 + wd=0.01

PyTorch에서의 사용법 (참고용)

import torch.optim as optim

# ═══════════════════════════════════════════════════════════════
# 🔧 PyTorch 옵티마이저 사용법
# ═══════════════════════════════════════════════════════════════

# 1️⃣ SGD (기본)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 2️⃣ SGD + Momentum (관성 추가)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# 3️⃣ Adam (가장 많이 사용! ⭐)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 4️⃣ AdamW (weight decay 포함 - 정규화 효과)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# ─────────────────────────────────────────────────────────────────
# 📉 학습률 스케줄러
# ─────────────────────────────────────────────────────────────────
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# 🔄 학습 루프
for epoch in range(100):
    loss = train_one_epoch()
    scheduler.step()   # ← 매 epoch마다 학습률 조정

7. 전체 흐름 정리

경사하강법의 전체 과정을 다시 한번 정리합시다:

단계	과정	코드
1️⃣ 순전파	현재 가중치로 예측값 계산	`y_pred = model(x)`
2️⃣ 손실 계산	예측값과 정답의 차이 측정	`loss = loss_fn(y_pred, y_true)`
3️⃣ 역전파	손실의 기울기(gradient) 계산	`loss.backward()`
4️⃣ 가중치 업데이트	옵티마이저가 가중치 수정	`optimizer.step()`
🔄 반복	손실이 충분히 작아질 때까지 1-4 반복	-

핵심 요약

개념	설명	비유
Batch GD	전체 데이터로 기울기 계산	모든 학생 답안을 다 보고 채점 기준 수정
SGD	데이터 1개로 기울기 계산	답안 하나 볼 때마다 바로 수정
Mini-batch	묶음(32~256)으로 계산	한 반씩 채점하고 수정
Momentum	과거 이동 방향을 기억	언덕에서 굴러가는 공
Adam	적응적 학습률 + Momentum	똑똑한 자동 조종 장치
LR Scheduling	학습률을 점점 줄임	처음엔 큰 걸음, 나중엔 작은 걸음

학습 체크리스트

• Batch, Mini-batch, SGD의 차이를 설명할 수 있다
• Momentum이 왜 SGD보다 빠른지 설명할 수 있다
• Adam 옵티마이저의 핵심 아이디어를 안다
• 학습률 스케줄링의 필요성을 이해한다
• 상황에 따라 옵티마이저를 선택할 수 있다

다음 강의 예고

"경사하강법 변형" - RMSprop, AdaGrad 등 더 다양한 옵티마이저와 그 수학적 배경을 살펴봅니다!

경사하강법 실전

📓Google Colab에서 실습하기

학습 내용