Project your monthly LLM API bill across OpenAI, Anthropic, Gemini, and self-hosted Llama. Models prompt caching + cache hit rate. 100% client-side — your numbers stay in your browser.
Cheapest first. Mini/Haiku/Flash models typically work for simple classification or extraction tasks; flagship models for reasoning + synthesis. Model routing combines them.
| Model | Monthly | Vs current |
|---|---|---|
| Gemini 1.5 Flash | $2.66 | −97% |
| Llama 3.3 8B (self-hosted) | $3.75 | −96% |
| OpenAI GPT-4o-mini | $5.55 | −94% |
| Llama 3.3 70B (self-hosted) | $15.00 | −84% |
| Claude 3.5 Haiku | $31.68 | −66% |
| Gemini 1.5 Pro | $44.38 | −52% |
| Claude 3.5 Sonnet | $119 | +28% |
30-minute call with a NKKTech senior engineer. We'll review your top three workloads, identify the biggest savings, and project the reduction. No pitch — just architecture advice.
Book a free cost-optimization callThe calculator uses public 2026 list pricing for each provider's standard tier. Input + output tokens are billed separately at different rates. When prompt caching is enabled, the cached portion of input tokens bills at a reduced rate — 50% off for OpenAI, 90% off for Anthropic, 75% off for Gemini. Self-hosted Llama figures are approximate fully-loaded GPU costs at steady utilization; actual cost varies with volume and reserved-capacity discounts.
What it does: projects steady-state cost for a fixed workload, compares across models, and suggests common optimizations (model routing, caching, batching).
What it doesn't do: capture spikey workloads, long-context anomalies, retry storms, or the cost of supporting services (embeddings, fine-tuning, vector database, observability). Real production bills are typically 10–30% higher than the calculator's projection because of these.
For a real cost audit on a system you're already running — including the architecture-level optimizations the calculator can't see — book the call below.
Concrete strategies that cut 50–80% cost without quality loss. Real numbers from client audits.
Decision framework for when to spend on which technique — and which one to pick first.
10-question score across 7 readiness dimensions. Best taken before scoping a specific build when you're not sure where to start.
3-year TCO + payback for RAG systems with retrieval. Vector DB comparison built in.