감성 분석 구현

학습 목표

•IMDB 데이터셋을 이해한다
•감성 분석 문제를 파악한다
•LSTM 기반 감성 분류기를 구현한다
•모델 학습 및 평가를 수행한다
•실제 리뷰로 예측해본다

감성 분석이란?

**감성 분석(Sentiment Analysis)**은 텍스트에서 감정이나 의견을 추출하는 NLP 작업입니다.

감성 분석의 유형

유형	설명	예시
이진 분류	긍정/부정	영화 리뷰
다중 분류	매우부정~매우긍정	별점 예측
측면 기반	여러 측면 평가	"음식은 좋지만 서비스는 나빠요"

활용 분야

•제품 리뷰 분석
•소셜 미디어 모니터링
•고객 피드백 분석
•브랜드 평판 관리

IMDB 데이터셋

IMDB 영화 리뷰 데이터셋은 감성 분석의 대표적인 벤치마크입니다.

데이터셋 구성

총 50,000개 리뷰
├── 학습: 25,000개
│   ├── 긍정: 12,500개
│   └── 부정: 12,500개
└── 테스트: 25,000개
    ├── 긍정: 12,500개
    └── 부정: 12,500개

데이터 예시

긍정 리뷰:
"This movie was absolutely fantastic! Great acting and story."
→ 레이블: 1 (긍정)

부정 리뷰:
"Terrible waste of time. The plot made no sense."
→ 레이블: 0 (부정)

PyTorch로 IMDB 로드

python
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# 토크나이저
tokenizer = get_tokenizer('basic_english')

# 데이터셋 로드
train_iter, test_iter = IMDB(split=('train', 'test'))

# 어휘 사전 구축
def yield_tokens(data_iter):
    for label, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(
    yield_tokens(train_iter),
    specials=['<unk>', '<pad>'],
    min_freq=5
)
vocab.set_default_index(vocab['<unk>'])

print(f"어휘 크기: {len(vocab)}")

LSTM 감성 분류기 모델

모델 아키텍처

입력 텍스트 → 임베딩 → LSTM → 전결합층 → 분류
      │           │        │        │
  [batch, seq]  [batch,   [batch,   [batch, 2]
              seq, emb]  hidden]

모델 구현

import torch
import torch.nn as nn

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim,
                 num_layers, num_classes, dropout=0.5):
        super().__init__()

        # 임베딩 레이어
        self.embedding = nn.Embedding(
            vocab_size,
            embed_dim,
            padding_idx=1  # <pad> 인덱스
        )

        # LSTM 레이어
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=True
        )

        # 드롭아웃
        self.dropout = nn.Dropout(dropout)

        # 출력 레이어 (양방향이므로 hidden_dim * 2)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        # x: (batch, seq_len)

        # 임베딩: (batch, seq_len, embed_dim)
        embedded = self.embedding(x)
        embedded = self.dropout(embedded)

        # LSTM: (batch, seq_len, hidden_dim * 2)
        lstm_out, (h_n, c_n) = self.lstm(embedded)

        # 양방향 마지막 은닉 상태 결합
        # h_n: (num_layers * 2, batch, hidden_dim)
        h_forward = h_n[-2]  # 순방향 마지막 레이어
        h_backward = h_n[-1]  # 역방향 마지막 레이어
        hidden = torch.cat([h_forward, h_backward], dim=1)

        # 드롭아웃
        hidden = self.dropout(hidden)

        # 분류: (batch, num_classes)
        output = self.fc(hidden)

        return output

# 모델 생성
model = SentimentLSTM(
    vocab_size=len(vocab),
    embed_dim=128,
    hidden_dim=256,
    num_layers=2,
    num_classes=2,
    dropout=0.5
)
print(model)

데이터 전처리

데이터 파이프라인

python⚠️ 로컬 실행 필요
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

# 텍스트 → 텐서 변환
def text_pipeline(text):
    return vocab(tokenizer(text))

def label_pipeline(label):
    return 1 if label == 'pos' else 0

# 배치 처리 함수
def collate_batch(batch):
    label_list, text_list = [], []

    for label, text in batch:
        label_list.append(label_pipeline(label))
        processed_text = torch.tensor(
            text_pipeline(text),
            dtype=torch.int64
        )
        text_list.append(processed_text)

    # 레이블 텐서
    labels = torch.tensor(label_list, dtype=torch.int64)

    # 텍스트 패딩
    texts = pad_sequence(
        text_list,
        batch_first=True,
        padding_value=vocab['<pad>']
    )

    return labels, texts

# 데이터로더 생성
train_iter, test_iter = IMDB(split=('train', 'test'))
train_dataloader = DataLoader(
    list(train_iter),
    batch_size=64,
    shuffle=True,
    collate_fn=collate_batch
)
test_dataloader = DataLoader(
    list(test_iter),
    batch_size=64,
    shuffle=False,
    collate_fn=collate_batch
)

학습 및 평가

학습 함수

import torch.optim as optim

# 하이퍼파라미터
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, dataloader, criterion, optimizer):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for labels, texts in dataloader:
        labels = labels.to(device)
        texts = texts.to(device)

        # 순전파
        optimizer.zero_grad()
        outputs = model(texts)
        loss = criterion(outputs, labels)

        # 역전파
        loss.backward()
        optimizer.step()

        # 통계
        total_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    avg_loss = total_loss / len(dataloader)
    accuracy = 100 * correct / total
    return avg_loss, accuracy

평가 함수

def evaluate(model, dataloader, criterion):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for labels, texts in dataloader:
            labels = labels.to(device)
            texts = texts.to(device)

            outputs = model(texts)
            loss = criterion(outputs, labels)

            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    avg_loss = total_loss / len(dataloader)
    accuracy = 100 * correct / total
    return avg_loss, accuracy

학습 루프

num_epochs = 10

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(
        model, train_dataloader, criterion, optimizer
    )
    test_loss, test_acc = evaluate(
        model, test_dataloader, criterion
    )

    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")
    print("-" * 50)

예측하기

새 리뷰 예측

def predict_sentiment(model, text, vocab, tokenizer):
    model.eval()

    # 전처리
    tokens = tokenizer(text)
    indices = torch.tensor(
        [vocab[token] for token in tokens],
        dtype=torch.int64
    ).unsqueeze(0).to(device)

    # 예측
    with torch.no_grad():
        output = model(indices)
        probabilities = torch.softmax(output, dim=1)
        prediction = torch.argmax(output, dim=1).item()

    sentiment = "긍정" if prediction == 1 else "부정"
    confidence = probabilities[0][prediction].item()

    return sentiment, confidence

# 테스트
reviews = [
    "This movie was absolutely amazing! Best film I've seen this year.",
    "Terrible movie. Complete waste of time and money.",
    "It was okay, nothing special but not bad either."
]

for review in reviews:
    sentiment, confidence = predict_sentiment(
        model, review, vocab, tokenizer
    )
    print(f"리뷰: {review[:50]}...")
    print(f"예측: {sentiment} (확신도: {confidence:.2%})")
    print()

예측 결과 예시

리뷰: This movie was absolutely amazing! Best film I'v...
예측: 긍정 (확신도: 95.23%)

리뷰: Terrible movie. Complete waste of time and money...
예측: 부정 (확신도: 92.87%)

리뷰: It was okay, nothing special but not bad either...
예측: 긍정 (확신도: 58.34%)

모델 개선 방법

1. 사전 학습 임베딩 사용

python
# GloVe 임베딩 로드
pretrained_embeddings = load_glove_embeddings(vocab)
model.embedding.weight.data.copy_(pretrained_embeddings)
model.embedding.weight.requires_grad = False  # 프리징

2. 양방향 LSTM + Attention

python
class AttentionLSTM(nn.Module):
    # Attention 메커니즘 추가
    pass

3. 데이터 증강

•동의어 치환
•역번역 (Back-translation)
•랜덤 삽입/삭제

핵심 정리

단계	내용
데이터 로드	IMDB 데이터셋
전처리	토큰화 → 인코딩 → 패딩
모델	임베딩 → LSTM → 분류
학습	CrossEntropyLoss + Adam
평가	정확도, 손실
예측	softmax → argmax

실습 과제

•
모델 튜닝
- •hidden_dim, num_layers 변경
- •dropout 비율 조정
•
GRU 비교
- •LSTM을 GRU로 교체
- •성능 및 학습 속도 비교
•
한국어 감성 분석
- •네이버 영화 리뷰 데이터 활용
- •한국어 토크나이저 적용

레벨 6 완료!

축하합니다! Level 6의 모든 레슨을 완료했습니다.

배운 내용 정리

•시퀀스 데이터의 특성
•RNN의 구조와 한계
•LSTM과 GRU의 게이트 메커니즘
•텍스트 전처리 파이프라인
•실제 감성 분석 구현

다음 단계

Level 7에서는 어텐션(Attention) 메커니즘과 **트랜스포머(Transformer)**를 배웁니다!

감성 분석 구현

📓Google Colab에서 실습하기

학습 내용

감성 분석 구현

학습 목표

감성 분석이란?

감성 분석의 유형

활용 분야

IMDB 데이터셋

데이터셋 구성

데이터 예시

PyTorch로 IMDB 로드

LSTM 감성 분류기 모델

모델 아키텍처

모델 구현

데이터 전처리

데이터 파이프라인

학습 및 평가

학습 함수

평가 함수

학습 루프

예측하기

새 리뷰 예측

예측 결과 예시

모델 개선 방법

1. 사전 학습 임베딩 사용

2. 양방향 LSTM + Attention

3. 데이터 증강

핵심 정리

실습 과제

레벨 6 완료!

배운 내용 정리

다음 단계

레슨 정보

💡실습 환경 안내

이 레벨의 다른 레슨