Free Tool · No Email Required

LLM Cost Calculator

Project your monthly LLM API bill across OpenAI, Anthropic, Gemini, and self-hosted Llama. Models prompt caching + cache hit rate. 100% client-side — your numbers stay in your browser.

Calculations shown in USD. Reference: $1,000/month ≈ ≈25.000.000 ₫

Workload inputs

Model / providerRequests per monthAvg input tokens / request

Includes system prompt + retrieved context + user query.

Avg output tokens / requestPrompt caching enabled?

Yes — system prompt is cached (5-min window for OpenAI, configurable for Anthropic)

Cache hit rate: 30%

Fraction of input tokens served from cache. Typical: 30–60% for stable system prompts; 0–20% for highly dynamic prompts.

Monthly cost

$92.50

OpenAI GPT-4o

Per request

$0.0092

Annual

$1,110

Same workload, other models

Cheapest first. Mini/Haiku/Flash models typically work for simple classification or extraction tasks; flagship models for reasoning + synthesis. Model routing combines them.

Model	Monthly	Vs current
Gemini 1.5 Flash	$2.66	−97%
Llama 3.3 8B (self-hosted)	$3.75	−96%
OpenAI GPT-4o-mini	$5.55	−94%
Llama 3.3 70B (self-hosted)	$15.00	−84%
Claude 3.5 Haiku	$31.68	−66%
Gemini 1.5 Pro	$44.38	−52%
Claude 3.5 Sonnet	$119	+28%

Spending over $2K/month? Get a free cost audit.

30-minute call with a NKKTech senior engineer. We'll review your top three workloads, identify the biggest savings, and project the reduction. No pitch — just architecture advice.

Book a free cost-optimization call

How this works

The calculator uses public 2026 list pricing for each provider's standard tier. Input + output tokens are billed separately at different rates. When prompt caching is enabled, the cached portion of input tokens bills at a reduced rate — 50% off for OpenAI, 90% off for Anthropic, 75% off for Gemini. Self-hosted Llama figures are approximate fully-loaded GPU costs at steady utilization; actual cost varies with volume and reserved-capacity discounts.

What it does: projects steady-state cost for a fixed workload, compares across models, and suggests common optimizations (model routing, caching, batching).

What it doesn't do: capture spikey workloads, long-context anomalies, retry storms, or the cost of supporting services (embeddings, fine-tuning, vector database, observability). Real production bills are typically 10–30% higher than the calculator's projection because of these.

For a real cost audit on a system you're already running — including the architecture-level optimizations the calculator can't see — book the call below.

Deep dive

LLM Cost Optimization: Routing, Caching, Quantization (2026)

Concrete strategies that cut 50–80% cost without quality loss. Real numbers from client audits.

Pillar #5

LLM Fine-Tuning vs RAG vs Prompt Engineering

Decision framework for when to spend on which technique — and which one to pick first.

Related tool

AI Readiness Quiz

10-question score across 7 readiness dimensions. Best taken before scoping a specific build when you're not sure where to start.

Related tool

RAG ROI Calculator

3-year TCO + payback for RAG systems with retrieval. Vector DB comparison built in.

Workload inputs

Model / providerRequests per monthAvg input tokens / request

Includes system prompt + retrieved context + user query.

Avg output tokens / requestPrompt caching enabled?

Yes — system prompt is cached (5-min window for OpenAI, configurable for Anthropic)

Cache hit rate: 30%

Fraction of input tokens served from cache. Typical: 30–60% for stable system prompts; 0–20% for highly dynamic prompts.

Same workload, other models

Cheapest first. Mini/Haiku/Flash models typically work for simple classification or extraction tasks; flagship models for reasoning + synthesis. Model routing combines them.

Model

Monthly

Vs current

Gemini 1.5 Flash

$2.66

−97%

Llama 3.3 8B (self-hosted)

$3.75

−96%

OpenAI GPT-4o-mini

$5.55

−94%

Llama 3.3 70B (self-hosted)

$15.00

−84%

Claude 3.5 Haiku

$31.68

−66%

Gemini 1.5 Pro

$44.38

−52%

Claude 3.5 Sonnet

$119

+28%

How this works

What it does: projects steady-state cost for a fixed workload, compares across models, and suggests common optimizations (model routing, caching, batching).

For a real cost audit on a system you're already running — including the architecture-level optimizations the calculator can't see — book the call below.