infrastructure

Cost Optimization

Tối ưu chi phí: AI không đốt tiền

Độ khóadvanced

1Dự đoán1/7

Dự đoán

Startup Việt Nam chạy chatbot GPT-4o, 100K request/ngày, hoá đơn $90K/tháng. CEO yêu cầu giảm còn $20K mà không hy sinh chất lượng rõ rệt. Khả thi không?

2Khám phá2/7

Hãy tưởng tượng hệ thống AI của bạn giống một nhà hàng cao cấp. Mỗi request đến là một bàn đặt món. Đầu bếp chính (GPT-4o) làm được mọi món nhưng đắt. lương cao, quầy đặc biệt. Phụ bếp (Haiku/Flash) làm được 70% món đơn giản (salad, mì gói, bánh mì) với chi phí chỉ bằng 1/20.

Câu hỏi không phải là "sa thải đầu bếp chính". Câu hỏi đúng là: ai làm món nào? Khi nào nên dùng món cũ trong tủ lạnh (cache)? Khi nào nên chuẩn bị trước hàng loạt (batch) để giảm thao tác riêng lẻ? Có cần đọc hết cuốn menu 100 trang mỗi lần không (context compression)?

Thử ngay với máy tính dưới đây. Kéo thanh trượt, chọn model, và bật tắt từng tối ưu để thấy hoá đơn thay đổi theo thời gian thực.

Hình minh họa

Requests / ngày

100,000 req

Input tokens / request

800 token

Output tokens / request

300 token

Model chính (flagship)

Đa năng, mạnh về code + suy luận dài.

Model nhỏ (dùng khi bật routing)

Nhanh, rẻ, phù hợp tác vụ phân loại + tóm tắt ngắn.

Chi phí tháng (baseline)

$25,500

Sau tối ưu (0/4 bật)

$25,500

Tiết kiệm

Nén context (LLMLingua / re-rank)

Routing sang model nhỏ

Prompt / Semantic caching

Batch API

Lưu ý: các hệ số tối ưu là trung bình ngành. Thực tế dao động theo workload, tỉ lệ cache hit và phân bố độ khó request.

Mẹo khám phá

Thử tắt tất cả toggle → bật dần từng cái. Toggle nào đóng góp nhiều nhất cho workload của bạn? Gợi ý: khi request ngắn (< 500 input token), caching & routing thắng; khi request dài, compression thắng.

3Khoảnh khắc Aha3/7

Tối ưu chi phí LLM không phải là "tìm model rẻ nhất". Đó là một cascade nhiều lớp: đầu tiên đừng hỏi những gì có thể cache lại, kế đến hỏi model nhỏ trước, model lớn sau nếu cần, và cuối cùng khi phải gọi. hãy gọi với ít token nhất có thể.

4Thử thách4/7

Workload A: 500K request/ngày, mỗi request 200 input token + 50 output token (phân loại intent). Workload B: 5K request/ngày, mỗi request 6K input token + 1K output token (tóm tắt báo cáo dài). Workload nào hưởng lợi NHIỀU NHẤT từ context compression?

Team của bạn đang cache HTTP response 1 giờ cho /chat endpoint. User report: 'Chatbot trả lời sai thông tin sản phẩm mới'. Nguyên nhân gì?

5Lý thuyết5/7

Giải thích

Tối ưu chi phí LLM là tập hợp các kỹ thuật giảm chi tiêu trên một hệ thống AI mà vẫn duy trì chất lượng (đo bằng các metric nội dung. accuracy, user rating, task success rate. chứ không chỉ bằng latency).

Công thức chi phí tổng:

\text{Cost}_{\text{total}} = \underbrace{N_{\text{req}} \times C_{\text{per\_req}}}_{\text{Inference}} + \underbrace{N_{\text{GPU}} \times T \times P_{\text{GPU}}}_{\text{Infrastructure}} + \underbrace{C_{\text{ops}}}_{\text{Operations}}

Với API LLM, trọng tâm là C_{per_req}:

C_{\text{per\_req}} = \frac{\text{input}_{\text{tokens}} \cdot P_{\text{in}} + \text{output}_{\text{tokens}} \cdot P_{\text{out}}}{1000}

Trong đó P_in, P_out là giá mỗi 1K token của model tương ứng.

5 chiến lược cốt lõi:

1. Caching. đừng gọi LLM nếu câu hỏi đã được trả lời. Hai loại:

Prompt / prefix caching. giảm giá input cho phần prefix trùng khớp chính xác. Hỗ trợ native bởi OpenAI, Anthropic, Gemini.
Semantic caching. encode câu hỏi thành embedding, tìm entry gần nhất trong vector DB. Hit khi similarity vượt ngưỡng (thường 0.92–0.97).

\text{Cost}_{\text{cached}} = (1 - \text{hit\_rate}) \cdot \text{Cost}_{\text{original}}

2. Model routing. một classifier/router chọn model phù hợp:

\text{Cost}_{\text{routed}} = p_{\text{simple}} \cdot C_{\text{small}} + (1 - p_{\text{simple}}) \cdot C_{\text{large}}

Với p_simple ≈ 0.7 (phân bố 80/20 trong thực tế) và C_small / C_large ≈ 0.05 (Haiku vs GPT-4o), chi phí giảm hơn một nửa mà chất lượng giảm chưa tới 2%.

3. Context compression. giảm số input token trước khi gửi:

LLMLingua. LM nhỏ chấm điểm mỗi token, cắt token có perplexity thấp.
Re-rank top-k. RAG truy xuất 20 chunk, re-rank, chỉ nhét 3 chunk vào prompt.
Summarize then generate. model nhỏ tóm tắt history dài, model lớn đọc tóm tắt.

4. Batch API. OpenAI Batch, Anthropic Batch, Gemini Batch đều giảm 50% giá. Đánh đổi: kết quả trả về trong 24h, không phù hợp real-time.

5. Prompt / output engineering. output token đắt gấp 2-4 lần input. Mỗi lần cắt 100 output token = tiết kiệm trực tiếp. Kỹ thuật:

JSON schema strict → model không xả text thừa.
max_tokens = số thực tế cần, không để mặc định 4096.
Stop sequences để dừng sớm.
System prompt "ngắn gọn, không giải thích lại câu hỏi".

Thứ tự áp dụng tối ưu

Theo kinh nghiệm thực tế, thứ tự ROI cao → thấp: caching → routing → prompt engineering → compression → batch. Bắt đầu từ cái đơn giản nhất (caching. chỉ cần Redis) trước khi đụng đến routing (cần classifier).

Đo trước khi tối ưu

Không có log thì không thể tối ưu. Tối thiểu cần track: request_id, model, input_tokens, output_tokens, latency_ms, cost_usd, user_id. Dashboard hằng ngày theo model + theo endpoint.

Code mẫu 1. đếm token và ước lượng chi phí trước khi gọi:

token_counter.py. ước lượng chi phí trước khi gọi API

"""Đếm token và ước lượng chi phí trước khi gọi LLM API.

Chức năng:
  - Đếm token chính xác bằng tokenizer chính thức (tiktoken cho OpenAI).
  - Ước lượng chi phí USD theo bảng giá cập nhật.
  - Từ chối request nếu vượt budget/request hoặc budget/ngày.
  - Ghi log để dashboard chi phí phân tích sau.
"""

from __future__ import annotations

import json
import time
from dataclasses import dataclass, asdict
from typing import Iterable

import tiktoken  # pip install tiktoken


# Bảng giá USD / 1K token. đồng bộ với bảng chính thức trước khi dùng prod.
MODEL_PRICES = {
    "gpt-4o":          {"in": 0.005,    "out": 0.015},
    "gpt-4o-mini":     {"in": 0.00015,  "out": 0.00060},
    "claude-3-5-sonnet": {"in": 0.003,  "out": 0.015},
    "claude-3-haiku":  {"in": 0.00025,  "out": 0.00125},
    "gemini-1.5-flash": {"in": 0.000075,"out": 0.0003},
}


@dataclass
class TokenEstimate:
    model: str
    input_tokens: int
    estimated_output_tokens: int
    cost_input_usd: float
    cost_output_usd: float
    total_usd: float

    def to_json(self) -> str:
        return json.dumps(asdict(self), ensure_ascii=False)


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Đếm token chính xác bằng tokenizer của model tương ứng."""
    try:
        enc = tiktoken.encoding_for_model(model)
    except KeyError:
        enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))


def estimate_cost(
    prompt: str,
    model: str = "gpt-4o",
    expected_output_tokens: int = 300,
) -> TokenEstimate:
    """Ước lượng chi phí dựa trên prompt và số output dự kiến."""
    if model not in MODEL_PRICES:
        raise ValueError(f"Unknown model: {model}")

    price = MODEL_PRICES[model]
    input_tokens = count_tokens(prompt, model)

    cost_in = input_tokens / 1000 * price["in"]
    cost_out = expected_output_tokens / 1000 * price["out"]
    total = cost_in + cost_out

    return TokenEstimate(
        model=model,
        input_tokens=input_tokens,
        estimated_output_tokens=expected_output_tokens,
        cost_input_usd=round(cost_in, 6),
        cost_output_usd=round(cost_out, 6),
        total_usd=round(total, 6),
    )


class BudgetGuard:
    """Từ chối request nếu vượt budget/request hoặc budget/ngày."""

    def __init__(
        self,
        max_per_request_usd: float = 0.50,
        max_per_day_usd: float = 500.0,
    ) -> None:
        self.max_per_request = max_per_request_usd
        self.max_per_day = max_per_day_usd
        self._spent_today_usd = 0.0
        self._day_start = time.time()

    def _reset_if_new_day(self) -> None:
        if time.time() - self._day_start > 86400:
            self._spent_today_usd = 0.0
            self._day_start = time.time()

    def check(self, estimate: TokenEstimate) -> None:
        self._reset_if_new_day()
        if estimate.total_usd > self.max_per_request:
            raise RuntimeError(
                f"Request cost $"
                f"{estimate.total_usd:.4f} > limit $"
                f"{self.max_per_request:.2f}"
            )
        if self._spent_today_usd + estimate.total_usd > self.max_per_day:
            raise RuntimeError(
                f"Daily budget exceeded: $"
                f"{self._spent_today_usd + estimate.total_usd:.2f} > $"
                f"{self.max_per_day:.2f}"
            )

    def record(self, estimate: TokenEstimate) -> None:
        self._spent_today_usd += estimate.total_usd


# --- Ví dụ sử dụng ---
if __name__ == "__main__":
    prompt = "Tóm tắt báo cáo Q3 này thành 5 gạch đầu dòng: ..."
    est = estimate_cost(prompt, model="gpt-4o", expected_output_tokens=200)
    print(est.to_json())

    guard = BudgetGuard(max_per_request_usd=0.10, max_per_day_usd=100.0)
    try:
        guard.check(est)
        # ... gọi LLM thật ở đây ...
        guard.record(est)
    except RuntimeError as e:
        print("Rejected:", e)

Code mẫu 2. semantic cache + model router đơn giản:

router_cache.py. cache + routing

"""Pipeline tối ưu chi phí cho chatbot:
    1) Semantic cache (Redis + embedding).
    2) Classifier định tuyến small vs flagship model.
    3) Ghi log chi phí để dashboard phân tích.
"""

from __future__ import annotations

import hashlib
import json
import os
import time
from typing import Literal, Optional

import numpy as np
import redis
from sentence_transformers import SentenceTransformer


EMBEDDER = SentenceTransformer("all-MiniLM-L6-v2")
CACHE = redis.Redis.from_url(os.environ.get("REDIS_URL", "redis://localhost:6379"))
SIMILARITY_THRESHOLD = 0.95


def _embed(text: str) -> np.ndarray:
    return EMBEDDER.encode(text, normalize_embeddings=True)


def _cache_key(prefix: str, text: str) -> str:
    digest = hashlib.sha1(text.encode("utf-8")).hexdigest()[:16]
    return f"{prefix}:{digest}"


def semantic_cache_lookup(query: str) -> Optional[str]:
    """Tìm câu trả lời gần nhất trong cache. Trả None nếu miss."""
    query_emb = _embed(query).astype(np.float32)

    # Quét tối đa 200 entry gần nhất. production nên dùng FAISS / pgvector.
    for key in CACHE.scan_iter(match="cache:*", count=200):
        raw = CACHE.get(key)
        if raw is None:
            continue
        entry = json.loads(raw)
        cached_emb = np.asarray(entry["embedding"], dtype=np.float32)
        similarity = float(np.dot(query_emb, cached_emb))
        if similarity >= SIMILARITY_THRESHOLD:
            return entry["response"]
    return None


def semantic_cache_store(query: str, response: str, ttl: int = 3600) -> None:
    key = _cache_key("cache", query)
    payload = {
        "embedding": _embed(query).astype(np.float32).tolist(),
        "response": response,
        "stored_at": time.time(),
    }
    CACHE.setex(key, ttl, json.dumps(payload))


# ---------------------------------------------------------------------------
# Classifier định tuyến. trong thực tế có thể là logistic regression, small
# LM, hoặc gọi một API rẻ (Haiku). Ở đây mô phỏng bằng heuristic độ dài.
# ---------------------------------------------------------------------------

def classify_complexity(query: str) -> Literal["simple", "complex"]:
    word_count = len(query.split())
    has_code_hint = any(k in query.lower() for k in ["code", "python", "sql"])
    if word_count < 40 and not has_code_hint:
        return "simple"
    return "complex"


def call_small_model(query: str) -> str:
    # ... gọi Haiku / Flash / Llama-8B self-host ...
    return f"[small] Trả lời cho: {query[:40]}..."


def call_flagship_model(query: str) -> str:
    # ... gọi GPT-4o / Claude Sonnet ...
    return f"[flagship] Trả lời cho: {query[:40]}..."


def chat(query: str) -> dict:
    """Pipeline chính."""
    # 1. Thử cache.
    cached = semantic_cache_lookup(query)
    if cached:
        return {"response": cached, "source": "cache", "model": None}

    # 2. Route theo độ phức tạp.
    complexity = classify_complexity(query)
    if complexity == "simple":
        answer = call_small_model(query)
        model = "claude-haiku"
    else:
        answer = call_flagship_model(query)
        model = "gpt-4o"

    # 3. Lưu cache để lần sau khỏi gọi nữa.
    semantic_cache_store(query, answer)
    return {"response": answer, "source": "llm", "model": model}


if __name__ == "__main__":
    for q in [
        "Hà Nội hôm nay thời tiết thế nào?",
        "Thời tiết ở Hà Nội hôm nay ra sao?",
        "Viết Python function tính fibonacci bằng DP bottom-up",
    ]:
        r = chat(q)
        print(q, "->", r["source"], r["model"])

Thực tế tại startup Việt Nam

Mẫu phổ biến: self-host một model nhỏ (Llama-3-8B, Qwen2-7B) trên FPT Cloud / Zadark / VNG Cloud cho 70-80% request, fallback GPT-4o hoặc Claude Sonnet cho 20-30% phức tạp. Kết hợp semantic cache trên Redis. Từ $50K → $5-7K/tháng mà chất lượng giảm dưới 3%.

Trong thực tế, hãy thiết lập một pipeline tối giản: (1) mọi request đều đi qua layer budget-guard, (2) layer cache, (3) layer router, (4) layer model call. Log mỗi bước. Sau 2 tuần có dữ liệu, xem dashboard để biết layer nào đang ăn tiền nhiều nhất và tối ưu đúng chỗ đó.

Các chủ đề liên quan bạn có thể xem sau: tối ưu inference, tối ưu GPU, triển khai model.

6Tóm tắt6/7

Những điều cần nhớ về tối ưu chi phí LLM

Output token đắt hơn input token 2-4 lần. cắt output bằng JSON schema và max_tokens là ROI cao nhất / công sức thấp nhất.
Caching là layer tối ưu đầu tiên: prompt caching giảm 50% input cho prefix chung, semantic caching bắt cả câu hỏi tương tự.
Model routing 70/30 (nhỏ/lớn) dựa trên classifier giảm 60-80% chi phí với chất lượng giảm chưa tới 2% trên workload điển hình.
Context compression (LLMLingua, re-rank top-k) giảm 30-40% input token. thắng lớn trên RAG và tài liệu dài.
Batch API giảm 50% giá cho workload offline (embedding, phân loại hàng loạt, gán nhãn dataset).
Đo trước khi tối ưu: log request_id, model, tokens, latency, cost, user_id. Không có dashboard thì đang tối ưu trong bóng tối.

7Kiểm tra7/7

Kiểm tra hiểu biết

Câu 1/8

Startup chạy 200K request/ngày qua GPT-4o. Trước khi đổi model, bạn nên làm gì đầu tiên để giảm chi phí?

Chủ đề liên quan

Quantization: Lượng tử hóa mô hình GPU Optimization: Tối ưu GPU: đọc profiler trước khi sửa Edge AI: Edge AI: chạy model ngay trên thiết bị

Hình minh họa

Requests / ngày

100,000 req

Input tokens / request

800 token

Output tokens / request

300 token

Model chính (flagship)

Đa năng, mạnh về code + suy luận dài.

Model nhỏ (dùng khi bật routing)

Nhanh, rẻ, phù hợp tác vụ phân loại + tóm tắt ngắn.

Chi phí tháng (baseline)

$25,500

Sau tối ưu (0/4 bật)

$25,500

Tiết kiệm

Nén context (LLMLingua / re-rank)

Routing sang model nhỏ

Prompt / Semantic caching

Batch API

Lưu ý: các hệ số tối ưu là trung bình ngành. Thực tế dao động theo workload, tỉ lệ cache hit và phân bố độ khó request.

Mẹo khám phá

Giải thích

Công thức chi phí tổng:

\text{Cost}_{\text{total}} = \underbrace{N_{\text{req}} \times C_{\text{per\_req}}}_{\text{Inference}} + \underbrace{N_{\text{GPU}} \times T \times P_{\text{GPU}}}_{\text{Infrastructure}} + \underbrace{C_{\text{ops}}}_{\text{Operations}}

Với API LLM, trọng tâm là C_{per_req}:

C_{\text{per\_req}} = \frac{\text{input}_{\text{tokens}} \cdot P_{\text{in}} + \text{output}_{\text{tokens}} \cdot P_{\text{out}}}{1000}

Trong đó P_in, P_out là giá mỗi 1K token của model tương ứng.

5 chiến lược cốt lõi:

1. Caching. đừng gọi LLM nếu câu hỏi đã được trả lời. Hai loại:

Prompt / prefix caching. giảm giá input cho phần prefix trùng khớp chính xác. Hỗ trợ native bởi OpenAI, Anthropic, Gemini.
Semantic caching. encode câu hỏi thành embedding, tìm entry gần nhất trong vector DB. Hit khi similarity vượt ngưỡng (thường 0.92–0.97).

\text{Cost}_{\text{cached}} = (1 - \text{hit\_rate}) \cdot \text{Cost}_{\text{original}}

2. Model routing. một classifier/router chọn model phù hợp:

\text{Cost}_{\text{routed}} = p_{\text{simple}} \cdot C_{\text{small}} + (1 - p_{\text{simple}}) \cdot C_{\text{large}}

Với p_simple ≈ 0.7 (phân bố 80/20 trong thực tế) và C_small / C_large ≈ 0.05 (Haiku vs GPT-4o), chi phí giảm hơn một nửa mà chất lượng giảm chưa tới 2%.

3. Context compression. giảm số input token trước khi gửi:

LLMLingua. LM nhỏ chấm điểm mỗi token, cắt token có perplexity thấp.
Re-rank top-k. RAG truy xuất 20 chunk, re-rank, chỉ nhét 3 chunk vào prompt.
Summarize then generate. model nhỏ tóm tắt history dài, model lớn đọc tóm tắt.

4. Batch API. OpenAI Batch, Anthropic Batch, Gemini Batch đều giảm 50% giá. Đánh đổi: kết quả trả về trong 24h, không phù hợp real-time.

5. Prompt / output engineering. output token đắt gấp 2-4 lần input. Mỗi lần cắt 100 output token = tiết kiệm trực tiếp. Kỹ thuật:

JSON schema strict → model không xả text thừa.
max_tokens = số thực tế cần, không để mặc định 4096.
Stop sequences để dừng sớm.
System prompt "ngắn gọn, không giải thích lại câu hỏi".

Thứ tự áp dụng tối ưu

Đo trước khi tối ưu

Code mẫu 1. đếm token và ước lượng chi phí trước khi gọi:

token_counter.py. ước lượng chi phí trước khi gọi API

"""Đếm token và ước lượng chi phí trước khi gọi LLM API.

Chức năng:
  - Đếm token chính xác bằng tokenizer chính thức (tiktoken cho OpenAI).
  - Ước lượng chi phí USD theo bảng giá cập nhật.
  - Từ chối request nếu vượt budget/request hoặc budget/ngày.
  - Ghi log để dashboard chi phí phân tích sau.
"""

from __future__ import annotations

import json
import time
from dataclasses import dataclass, asdict
from typing import Iterable

import tiktoken  # pip install tiktoken


# Bảng giá USD / 1K token. đồng bộ với bảng chính thức trước khi dùng prod.
MODEL_PRICES = {
    "gpt-4o":          {"in": 0.005,    "out": 0.015},
    "gpt-4o-mini":     {"in": 0.00015,  "out": 0.00060},
    "claude-3-5-sonnet": {"in": 0.003,  "out": 0.015},
    "claude-3-haiku":  {"in": 0.00025,  "out": 0.00125},
    "gemini-1.5-flash": {"in": 0.000075,"out": 0.0003},
}


@dataclass
class TokenEstimate:
    model: str
    input_tokens: int
    estimated_output_tokens: int
    cost_input_usd: float
    cost_output_usd: float
    total_usd: float

    def to_json(self) -> str:
        return json.dumps(asdict(self), ensure_ascii=False)


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Đếm token chính xác bằng tokenizer của model tương ứng."""
    try:
        enc = tiktoken.encoding_for_model(model)
    except KeyError:
        enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))


def estimate_cost(
    prompt: str,
    model: str = "gpt-4o",
    expected_output_tokens: int = 300,
) -> TokenEstimate:
    """Ước lượng chi phí dựa trên prompt và số output dự kiến."""
    if model not in MODEL_PRICES:
        raise ValueError(f"Unknown model: {model}")

    price = MODEL_PRICES[model]
    input_tokens = count_tokens(prompt, model)

    cost_in = input_tokens / 1000 * price["in"]
    cost_out = expected_output_tokens / 1000 * price["out"]
    total = cost_in + cost_out

    return TokenEstimate(
        model=model,
        input_tokens=input_tokens,
        estimated_output_tokens=expected_output_tokens,
        cost_input_usd=round(cost_in, 6),
        cost_output_usd=round(cost_out, 6),
        total_usd=round(total, 6),
    )


class BudgetGuard:
    """Từ chối request nếu vượt budget/request hoặc budget/ngày."""

    def __init__(
        self,
        max_per_request_usd: float = 0.50,
        max_per_day_usd: float = 500.0,
    ) -> None:
        self.max_per_request = max_per_request_usd
        self.max_per_day = max_per_day_usd
        self._spent_today_usd = 0.0
        self._day_start = time.time()

    def _reset_if_new_day(self) -> None:
        if time.time() - self._day_start > 86400:
            self._spent_today_usd = 0.0
            self._day_start = time.time()

    def check(self, estimate: TokenEstimate) -> None:
        self._reset_if_new_day()
        if estimate.total_usd > self.max_per_request:
            raise RuntimeError(
                f"Request cost $"
                f"{estimate.total_usd:.4f} > limit $"
                f"{self.max_per_request:.2f}"
            )
        if self._spent_today_usd + estimate.total_usd > self.max_per_day:
            raise RuntimeError(
                f"Daily budget exceeded: $"
                f"{self._spent_today_usd + estimate.total_usd:.2f} > $"
                f"{self.max_per_day:.2f}"
            )

    def record(self, estimate: TokenEstimate) -> None:
        self._spent_today_usd += estimate.total_usd


# --- Ví dụ sử dụng ---
if __name__ == "__main__":
    prompt = "Tóm tắt báo cáo Q3 này thành 5 gạch đầu dòng: ..."
    est = estimate_cost(prompt, model="gpt-4o", expected_output_tokens=200)
    print(est.to_json())

    guard = BudgetGuard(max_per_request_usd=0.10, max_per_day_usd=100.0)
    try:
        guard.check(est)
        # ... gọi LLM thật ở đây ...
        guard.record(est)
    except RuntimeError as e:
        print("Rejected:", e)

Code mẫu 2. semantic cache + model router đơn giản:

router_cache.py. cache + routing

"""Pipeline tối ưu chi phí cho chatbot:
    1) Semantic cache (Redis + embedding).
    2) Classifier định tuyến small vs flagship model.
    3) Ghi log chi phí để dashboard phân tích.
"""

from __future__ import annotations

import hashlib
import json
import os
import time
from typing import Literal, Optional

import numpy as np
import redis
from sentence_transformers import SentenceTransformer


EMBEDDER = SentenceTransformer("all-MiniLM-L6-v2")
CACHE = redis.Redis.from_url(os.environ.get("REDIS_URL", "redis://localhost:6379"))
SIMILARITY_THRESHOLD = 0.95


def _embed(text: str) -> np.ndarray:
    return EMBEDDER.encode(text, normalize_embeddings=True)


def _cache_key(prefix: str, text: str) -> str:
    digest = hashlib.sha1(text.encode("utf-8")).hexdigest()[:16]
    return f"{prefix}:{digest}"


def semantic_cache_lookup(query: str) -> Optional[str]:
    """Tìm câu trả lời gần nhất trong cache. Trả None nếu miss."""
    query_emb = _embed(query).astype(np.float32)

    # Quét tối đa 200 entry gần nhất. production nên dùng FAISS / pgvector.
    for key in CACHE.scan_iter(match="cache:*", count=200):
        raw = CACHE.get(key)
        if raw is None:
            continue
        entry = json.loads(raw)
        cached_emb = np.asarray(entry["embedding"], dtype=np.float32)
        similarity = float(np.dot(query_emb, cached_emb))
        if similarity >= SIMILARITY_THRESHOLD:
            return entry["response"]
    return None


def semantic_cache_store(query: str, response: str, ttl: int = 3600) -> None:
    key = _cache_key("cache", query)
    payload = {
        "embedding": _embed(query).astype(np.float32).tolist(),
        "response": response,
        "stored_at": time.time(),
    }
    CACHE.setex(key, ttl, json.dumps(payload))


# ---------------------------------------------------------------------------
# Classifier định tuyến. trong thực tế có thể là logistic regression, small
# LM, hoặc gọi một API rẻ (Haiku). Ở đây mô phỏng bằng heuristic độ dài.
# ---------------------------------------------------------------------------

def classify_complexity(query: str) -> Literal["simple", "complex"]:
    word_count = len(query.split())
    has_code_hint = any(k in query.lower() for k in ["code", "python", "sql"])
    if word_count < 40 and not has_code_hint:
        return "simple"
    return "complex"


def call_small_model(query: str) -> str:
    # ... gọi Haiku / Flash / Llama-8B self-host ...
    return f"[small] Trả lời cho: {query[:40]}..."


def call_flagship_model(query: str) -> str:
    # ... gọi GPT-4o / Claude Sonnet ...
    return f"[flagship] Trả lời cho: {query[:40]}..."


def chat(query: str) -> dict:
    """Pipeline chính."""
    # 1. Thử cache.
    cached = semantic_cache_lookup(query)
    if cached:
        return {"response": cached, "source": "cache", "model": None}

    # 2. Route theo độ phức tạp.
    complexity = classify_complexity(query)
    if complexity == "simple":
        answer = call_small_model(query)
        model = "claude-haiku"
    else:
        answer = call_flagship_model(query)
        model = "gpt-4o"

    # 3. Lưu cache để lần sau khỏi gọi nữa.
    semantic_cache_store(query, answer)
    return {"response": answer, "source": "llm", "model": model}


if __name__ == "__main__":
    for q in [
        "Hà Nội hôm nay thời tiết thế nào?",
        "Thời tiết ở Hà Nội hôm nay ra sao?",
        "Viết Python function tính fibonacci bằng DP bottom-up",
    ]:
        r = chat(q)
        print(q, "->", r["source"], r["model"])

Thực tế tại startup Việt Nam

Các chủ đề liên quan bạn có thể xem sau: tối ưu inference, tối ưu GPU, triển khai model.

Kiểm tra hiểu biết

Câu 1/8

Startup chạy 200K request/ngày qua GPT-4o. Trước khi đổi model, bạn nên làm gì đầu tiên để giảm chi phí?