LLM 모니터링이 필요한 이유

LLM 애플리케이션은 기존 소프트웨어와 다른 관찰이 필요합니다. 프롬프트 변경 하나가 품질과 비용 모두에 영향을 미칩니다.

mermaid


flowchart LR
    App["LLM 앱"] --> Collector[모니터링 수집기]
    Collector --> Cost["💰 비용 추적<br/>토큰/API 비용"]
    Collector --> Quality["🎯 품질 측정<br/>정확도/만족도"]
    Collector --> Latency["⚡ 레이턴시<br/>TTFT/TPS"]
    Collector --> Usage["📊 사용 패턴<br/>사용자/기능별"]
    Cost --> Dashboard[대시보드]
    Quality --> Dashboard
    Latency --> Dashboard
    Usage --> Dashboard
    Dashboard --> Alert[알림]

Langfuse: 오픈소스 LLM 관찰 플랫폼

Langfuse는 셀프 호스팅 가능한 LLM 추적/모니터링 도구입니다.

python

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import anthropic

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com"
)

@observe()
def answer_question(user_id: str, question: str) -> str:
    langfuse_context.update_current_trace(
        user_id=user_id,
        tags=["production", "qa-bot"],
        metadata={"feature": "customer-support"}
    )

    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": question}]
    )

    result = response.content[0].text

    # 사용자 피드백 연동
    langfuse_context.update_current_observation(
        output=result,
        usage={
            "input": response.usage.input_tokens,
            "output": response.usage.output_tokens
        }
    )

    return result

# 사용자 피드백 수집
def record_feedback(trace_id: str, score: float, comment: str = ""):
    langfuse.score(
        trace_id=trace_id,
        name="user-satisfaction",
        value=score,  # 0.0 ~ 1.0
        comment=comment
    )

커스텀 메트릭: OpenTelemetry + Prometheus

표준 OpenTelemetry로 기존 모니터링 스택과 통합합니다.

python

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusExporter
import time

# 메트릭 초기화
meter = metrics.get_meter("llm-app")

llm_request_counter = meter.create_counter(
    "llm_requests_total",
    description="LLM API 호출 횟수"
)
llm_token_counter = meter.create_counter(
    "llm_tokens_total",
    description="사용된 토큰 수"
)
llm_latency_histogram = meter.create_histogram(
    "llm_latency_seconds",
    description="LLM API 응답 시간 (초)"
)
llm_cost_counter = meter.create_counter(
    "llm_cost_usd_total",
    description="LLM API 총 비용 (USD)"
)

# 토큰당 비용 (USD per 1K tokens)
MODEL_COSTS = {
    "claude-sonnet-4-6-20251001": {"input": 0.003, "output": 0.015},
    "claude-haiku-4-5-20251001":  {"input": 0.0008, "output": 0.004},
    "gpt-4o-2024-11-20":          {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini":                 {"input": 0.00015, "output": 0.0006},
}

def track_llm_call(model: str, input_tokens: int, output_tokens: int,
                   latency: float, feature: str = "unknown"):
    labels = {"model": model, "feature": feature}

    llm_request_counter.add(1, labels)
    llm_token_counter.add(input_tokens, {**labels, "type": "input"})
    llm_token_counter.add(output_tokens, {**labels, "type": "output"})
    llm_latency_histogram.record(latency, labels)

    # 비용 계산
    if model in MODEL_COSTS:
        costs = MODEL_COSTS[model]
        cost = (input_tokens / 1000 * costs["input"] +
                output_tokens / 1000 * costs["output"])
        llm_cost_counter.add(cost, labels)

비용 최적화 전략

1. 모델 라우팅 (Task-based Routing)

python

from enum import Enum

class TaskComplexity(Enum):
    SIMPLE = "simple"    # 분류, 요약, 간단한 QA
    MEDIUM = "medium"    # 분석, 다단계 추론
    COMPLEX = "complex"  # 긴 문서 처리, 복잡한 코드 생성

MODEL_ROUTING = {
    TaskComplexity.SIMPLE:  "claude-haiku-4-5-20251001",   # $0.0008/1K
    TaskComplexity.MEDIUM:  "gpt-4o-mini",                  # $0.00015/1K
    TaskComplexity.COMPLEX: "claude-sonnet-4-6-20251001",   # $0.003/1K
}

def classify_complexity(prompt: str) -> TaskComplexity:
    prompt_len = len(prompt.split())
    if prompt_len < 50:
        return TaskComplexity.SIMPLE
    elif prompt_len < 500:
        return TaskComplexity.MEDIUM
    return TaskComplexity.COMPLEX

def smart_route(prompt: str) -> str:
    complexity = classify_complexity(prompt)
    return MODEL_ROUTING[complexity]

2. 시맨틱 캐싱

python

import hashlib
import json
from openai import OpenAI

openai = OpenAI()

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.cache: list[dict] = []

    def get_embedding(self, text: str) -> list[float]:
        return openai.embeddings.create(
            input=text, model="text-embedding-3-small"
        ).data[0].embedding

    def find(self, query: str) -> str | None:
        import numpy as np
        query_vec = self.get_embedding(query)

        for item in self.cache:
            similarity = np.dot(query_vec, item["embedding"]) / (
                np.linalg.norm(query_vec) * np.linalg.norm(item["embedding"])
            )
            if similarity >= self.threshold:
                return item["response"]
        return None

    def store(self, query: str, response: str):
        self.cache.append({
            "query": query,
            "embedding": self.get_embedding(query),
            "response": response
        })

cache = SemanticCache()

def cached_llm_call(prompt: str) -> str:
    cached = cache.find(prompt)
    if cached:
        return cached  # 캐시 히트: API 비용 0원

    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text
    cache.store(prompt, result)
    return result

모니터링 도구 비교

도구	오픈소스	셀프호스팅	비용 추적	평가/피드백	특징
Langfuse	✅	✅	✅	✅	가장 완성도 높음
Helicone	❌	제한적	✅	✅	설정 간단, 프록시 방식
Phoenix (Arize)	✅	✅	❌	✅	로컬 개발 최적화
LangSmith	❌	❌	✅	✅	LangChain 네이티브
OpenLIT	✅	✅	✅	❌	OpenTelemetry 표준

기술 심층 분석

LLM 핵심 메트릭

TTFT (Time to First Token): 첫 토큰까지의 대기 시간. 스트리밍 체감 속도에 직결
TPS (Tokens Per Second): 초당 토큰 생성 속도. 처리량 지표
E2E Latency: 전체 응답 완료 시간
Token Usage: 입력/출력 토큰 수. 비용의 직접적 지표

품질 자동 평가 (LLM-as-Judge)

python

def llm_judge(question: str, answer: str, reference: str = None) -> dict:
    prompt = f"질문: {question}
답변: {answer}"
    if reference:
        prompt += f"
정답 예시: {reference}"
    prompt += "

위 답변을 1-5점으로 평가하고 이유를 JSON으로 반환: {{"score": N, "reason": "..."}}"

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(response.content[0].text)

LLM 모니터링 완전 가이드: 비용·품질·지연시간 추적하기

핵심 포인트

LLM 모니터링이 필요한 이유

Langfuse: 오픈소스 LLM 관찰 플랫폼

커스텀 메트릭: OpenTelemetry + Prometheus

비용 최적화 전략

1. 모델 라우팅 (Task-based Routing)

2. 시맨틱 캐싱

모니터링 도구 비교

기술 심층 분석

LLM 핵심 메트릭

품질 자동 평가 (LLM-as-Judge)

Footnotes

이 글에서 다루는 AI

관련 글 더 보기

댓글

관련 모델

관련 방법론