멀티모달 AI란?

멀티모달 AI는 텍스트뿐 아니라 이미지, 오디오, 비디오 등 여러 형태의 데이터를 동시에 처리합니다. 이미지를 설명하거나, 차트를 해석하거나, 문서에서 정보를 추출하는 작업이 가능합니다.

mermaid


flowchart LR
    Image["이미지/PDF"] --> Encode[인코딩<br/>Base64/URL]
    Text["텍스트 질문"] --> API[멀티모달 LLM API]
    Encode --> API
    API --> GPT["GPT-4o<br/>Vision"]
    API --> Claude["Claude 3.5<br/>Sonnet"]
    API --> Gemini["Gemini 2.5<br/>Pro"]
    GPT --> Result["분석 결과<br/>텍스트"]
    Claude --> Result
    Gemini --> Result

GPT-4o Vision API

python

import base64
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_image(image_path: str, question: str) -> str:
    base64_image = encode_image(image_path)
    ext = Path(image_path).suffix.lower()
    media_type = {"jpg": "image/jpeg", ".jpeg": "image/jpeg",
                  ".png": "image/png", ".gif": "image/gif",
                  ".webp": "image/webp"}.get(ext, "image/jpeg")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{media_type};base64,{base64_image}",
                        "detail": "high"  # low/high/auto
                    }
                },
                {"type": "text", "text": question}
            ]
        }],
        max_tokens=1024
    )
    return response.choices[0].message.content

# URL로 직접 이미지 분석
def analyze_image_url(url: str, question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": url}},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.choices[0].message.content

# 사용 예시
result = analyze_image("chart.png", "이 차트에서 가장 높은 값은 무엇이고 어떤 트렌드를 보이나요?")
print(result)

Claude Vision API

Claude는 특히 긴 문서와 복잡한 레이아웃 분석에 강합니다.

python

import anthropic
import base64
import httpx

client = anthropic.Anthropic()

def analyze_with_claude(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    ext = image_path.rsplit(".", 1)[-1].lower()
    media_type_map = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
                      "png": "image/png", "gif": "image/gif", "webp": "image/webp"}
    media_type = media_type_map.get(ext, "image/jpeg")

    response = client.messages.create(
        model="claude-sonnet-4-5-20251001",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    },
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

# URL에서 다운로드 후 분석
def analyze_url_with_claude(url: str, question: str) -> str:
    image_data = base64.standard_b64encode(httpx.get(url).content).decode("utf-8")

    response = client.messages.create(
        model="claude-sonnet-4-5-20251001",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data},
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

PDF 문서 분석 (멀티페이지)

python

import fitz  # PyMuPDF
import base64
from anthropic import Anthropic

client = Anthropic()

def pdf_to_images(pdf_path: str, dpi: int = 150) -> list[str]:
    doc = fitz.open(pdf_path)
    images = []
    for page in doc:
        mat = fitz.Matrix(dpi/72, dpi/72)
        pix = page.get_pixmap(matrix=mat)
        img_bytes = pix.tobytes("png")
        images.append(base64.standard_b64encode(img_bytes).decode("utf-8"))
    doc.close()
    return images

def analyze_pdf(pdf_path: str, question: str, max_pages: int = 10) -> str:
    images = pdf_to_images(pdf_path)[:max_pages]

    content = []
    for i, img_data in enumerate(images):
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": img_data}
        })
    content.append({"type": "text", "text": question})

    response = client.messages.create(
        model="claude-sonnet-4-5-20251001",
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

# 계약서 핵심 조항 추출
result = analyze_pdf(
    "contract.pdf",
    "이 계약서에서 위약금, 계약 기간, 비밀유지 조항을 표 형태로 정리해줘"
)

실전 활용 사례

영수증/청구서 OCR

python

def extract_invoice_data(image_path: str) -> dict:
    prompt = '''이 영수증/청구서에서 다음 정보를 JSON으로 추출하세요:
    {
        "vendor": "가게/회사명",
        "date": "날짜 (YYYY-MM-DD)",
        "total": 총금액(숫자),
        "tax": 세금(숫자),
        "items": [{"name": "상품명", "qty": 수량, "price": 가격}]
    }'''

    result = analyze_with_claude(image_path, prompt)
    import json
    return json.loads(result.replace("```json", "").replace("```", "").strip())

# 상품 사진 → 상세 설명 생성
def generate_product_description(image_url: str) -> str:
    return analyze_image_url(
        image_url,
        "이 상품의 상세 설명을 한국어로 작성해줘. 특징, 소재, 활용 방법을 포함해서 200자 이내로."
    )

차트 데이터 추출

python

def extract_chart_data(chart_path: str) -> dict:
    prompt = '''이 차트에서 다음을 JSON으로 반환하세요:
    {
        "chart_type": "차트 유형",
        "title": "제목",
        "x_axis": "X축 레이블",
        "y_axis": "Y축 레이블",
        "data_points": [{"label": "라벨", "value": 값}],
        "trend": "전반적인 트렌드 설명"
    }'''

    import json
    result = analyze_with_claude(chart_path, prompt)
    return json.loads(result.replace("```json", "").replace("```", "").strip())

모델 멀티모달 기능 비교

모델	이미지/요청	해상도	PDF	동영상	강점
GPT-4o	최대 20개	최대 2048px	❌ (이미지 변환 필요)	❌	다이어그램, UI 분석
Claude 3.5 Sonnet	최대 20개	최대 8000px	✅ 네이티브	❌	긴 문서, 복잡한 레이아웃
Gemini 2.5 Pro	최대 16개	제한 없음	✅	✅	동영상, 초대형 문서
Gemini 2.5 Flash	최대 16개	제한 없음	✅	✅	가성비 최고

기술 심층 분석

Vision 모델 내부 동작

GPT-4V/Claude Vision은 CLIP 같은 인코더로 이미지를 패치(보통 224×224px 타일) 단위로 분할하고, 각 패치를 임베딩 벡터로 변환해 텍스트 토큰과 함께 트랜스포머에 입력합니다. detail: "high" 모드(OpenAI)는 이미지를 더 많은 타일로 분할해 세밀한 분석이 가능하지만 토큰 비용도 증가합니다.

이미지 토큰 비용

OpenAI GPT-4o detail: "low": 이미지당 고정 85토큰
OpenAI GPT-4o detail: "high": 이미지 크기에 따라 170~1360토큰
Claude: 이미지 크기/해상도에 비례 (1024×1024 ≈ 1600 tokens)

멀티모달 프롬프트 팁

구체적 질문: "이미지 설명해줘" 보다 "왼쪽 그래프의 2023년 매출액은?"
구조화 요청: JSON, 표, 목록 형식으로 출력 요청
좌표 참조: "왼쪽 상단의..." 처럼 공간적 참조 활용
체인 분석: 여러 이미지를 순서대로 제공해 비교 분석

멀티모달 AI 개발 가이드: 이미지+텍스트 분석 실전 구현

핵심 포인트

멀티모달 AI란?

GPT-4o Vision API

Claude Vision API

PDF 문서 분석 (멀티페이지)

실전 활용 사례

영수증/청구서 OCR

차트 데이터 추출

모델 멀티모달 기능 비교

기술 심층 분석

Vision 모델 내부 동작

이미지 토큰 비용

멀티모달 프롬프트 팁

Footnotes

이 글에서 다루는 AI

관련 글 더 보기

댓글

관련 모델

관련 방법론