LLM 서빙의 핵심 과제

LLM 서빙은 일반 웹 서비스와 다릅니다. GPU 메모리 관리, KV 캐시 최적화, 동시 요청 처리가 핵심입니다.

mermaid


flowchart TB
    Client["클라이언트 요청들"] --> LB[Load Balancer]
    LB --> Pod1["vLLM Pod<br/>GPU A100 x2"]
    LB --> Pod2["vLLM Pod<br/>GPU A100 x2"]
    LB --> Pod3["vLLM Pod<br/>GPU A100 x2"]
    HPA[HPA<br/>자동 스케일링] --> Pod1
    HPA --> Pod2
    HPA --> Pod3
    Pod1 --> Model[(모델 스토리지<br/>PVC/NFS)]
    Pod2 --> Model
    Pod3 --> Model

vLLM: 프로덕션 표준

vLLM은 PagedAttention 기술로 GPU 메모리를 최대 효율로 사용합니다.

yaml

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--tensor-parallel-size"
        - "2"              # GPU 2개 병렬 사용
        - "--max-model-len"
        - "8192"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--enable-chunked-prefill"  # 긴 프롬프트 청크 처리
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "2"
            memory: "80Gi"
          requests:
            nvidia.com/gpu: "2"
            memory: "60Gi"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

vLLM OpenAI 호환 API 사용:

python

from openai import OpenAI

# vLLM은 OpenAI API와 호환
client = OpenAI(
    base_url="http://vllm-service:8000/v1",
    api_key="token-abc123"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "안녕하세요!"}],
    max_tokens=512,
    temperature=0.7,
    stream=True  # 스트리밍 지원
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

HuggingFace TGI (Text Generation Inference)

TGI는 Continuous Batching으로 높은 처리량을 달성합니다.

yaml

# tgi-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-server
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:latest
        args:
        - "--model-id"
        - "mistralai/Mistral-7B-Instruct-v0.3"
        - "--num-shard"
        - "1"
        - "--max-concurrent-requests"
        - "128"
        - "--max-batch-prefill-tokens"
        - "4096"
        - "--quantize"
        - "bitsandbytes-nf4"  # 4비트 양자화
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: "1"

자동 스케일링 (KEDA)

GPU 서빙은 일반 CPU 메트릭 기반 HPA보다 요청 큐 기반 스케일링이 효과적입니다.

yaml

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300  # GPU 워밍업 고려
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_requests_waiting
      threshold: "10"  # 대기 요청 10개 초과 시 스케일업
      query: sum(vllm:num_requests_waiting)

모델 가중치 사전 로드 (init container)

모델 다운로드 시간 단축을 위한 Init Container 패턴:

yaml

initContainers:
- name: model-downloader
  image: python:3.11-slim
  command:
  - python
  - -c
  - |
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id="meta-llama/Llama-3.1-8B-Instruct",
        local_dir="/models/llama-3.1-8b",
        ignore_patterns=["*.msgpack", "*.h5"]
    )
  volumeMounts:
  - name: model-storage
    mountPath: /models
  env:
  - name: HUGGING_FACE_HUB_TOKEN
    valueFrom:
      secretKeyRef:
        name: hf-token
        key: token

프레임워크 비교

항목	vLLM	TGI	Triton
최적화 기술	PagedAttention	Continuous Batching	멀티모델 관리
최고 처리량	★★★★★	★★★★	★★★★
설치 난이도	쉬움	쉬움	복잡
OpenAI 호환	✅ 완전 지원	✅ 지원	❌ gRPC
양자화	AWQ, GPTQ	bitsandbytes	TensorRT
멀티모델	제한적	제한적	✅ 강점
추천 상황	단일 모델 고처리량	빠른 시작	다중 모델 서빙

기술 심층 분석

PagedAttention (vLLM의 핵심)

기존 LLM 서빙은 KV 캐시를 연속 메모리에 할당해 단편화가 심합니다. PagedAttention은 OS의 가상 메모리처럼 비연속 메모리 페이지를 사용해 GPU 메모리 효율을 50-70% 향상시킵니다.

Tensor Parallelism vs Pipeline Parallelism

Tensor Parallelism: 레이어의 가중치를 GPU 간 분할. 레이턴시 낮음, 빠른 인터커넥트 필요 (NVLink)
Pipeline Parallelism: 모델 레이어를 순서대로 GPU에 배치. 인터커넥트 요구사항 낮음, 높은 처리량
실전: 같은 노드 GPU는 Tensor, 다른 노드는 Pipeline 병렬화

양자화 선택 가이드

FP16/BF16: 기본값. 정확도 최우선
AWQ (4비트): vLLM과 최고 궁합. 속도 2배, 품질 손실 최소
GPTQ (4비트): CPU 양자화. 오프라인 배포 편리
NF4 (bitsandbytes): TGI와 궁합. 개발 환경 편리

Kubernetes에서 LLM 서빙하기: vLLM, TGI, Triton 완전 가이드

핵심 포인트

LLM 서빙의 핵심 과제

vLLM: 프로덕션 표준

HuggingFace TGI (Text Generation Inference)

자동 스케일링 (KEDA)

모델 가중치 사전 로드 (init container)

프레임워크 비교

기술 심층 분석

PagedAttention (vLLM의 핵심)

Tensor Parallelism vs Pipeline Parallelism

양자화 선택 가이드

Footnotes

이 글에서 다루는 AI

관련 글 더 보기

댓글

관련 모델

관련 방법론