After running 20+ fine-tuning engagements at NKKTech in 2024-2026, the data is clear: LoRA wins for ~90% of production workloads, QLoRA wins when GPU memory is the binding constraint, and full fine-tuning wins for ~10% of cases where the workload truly needs the base model's full parameter update. Most teams overweight full fine-tuning because it sounds more 'serious' — they end up spending 10x the compute for negligible quality lift. Here's the production decision framework with the eval numbers that actually matter.
Why the Default Should Be LoRA (Not Full Fine-Tuning)
LoRA (Low-Rank Adaptation) trains a small set of additional parameters (~0.1-2% of the base model's parameter count) while freezing the base model. Computational cost: ~5-10x cheaper than full fine-tuning. Storage cost per fine-tuned variant: ~50-500MB vs 70GB+ for a full fine-tune of a 70B model. Quality: in our eval data across 20+ engagements, LoRA matches full fine-tuning on ~85% of tasks within 1-2 evaluation points (on standardized benchmarks like MMLU, GSM8K, and our custom client-specific evals). Where it falls short: tasks that require fundamental changes to the model's reasoning patterns (not just instruction-following or style adaptation). The defaults should be: try LoRA first, escalate to full fine-tuning only when the eval data clearly shows LoRA hitting a quality ceiling. This is the opposite of what most teams do — they default to full fine-tuning because that's what the tutorials show, then waste compute when LoRA would have worked.
LoRA: The 90% Solution for Most Production Workloads
Production LoRA configuration that works for ~90% of fine-tuning engagements: rank=16, alpha=32, target modules=[q_proj, k_proj, v_proj, o_proj] (the attention layers), dropout=0.05, learning rate=2e-4, 3 epochs over the training set. This config works without much tuning on Llama 3.3 70B, Mistral 7B/24B, Qwen 2.5 14B/32B, and most other modern decoders. Training infrastructure: a single A100 80GB handles up to 13B model + LoRA training. Two A100s or one H100 80GB handle up to 70B + LoRA. Training time: typically 4-12 hours for a 50K-example dataset on a 70B base. Storage: each fine-tuned LoRA adapter is 100-500MB, can be swapped in/out at inference time. Inference: minimal overhead vs base model (LoRA adapters merge into the base weights at inference, or attach as a low-cost compute layer with vLLM's LoRA support). What LoRA does well: domain adaptation (legal, medical, financial vocabulary), instruction tuning to your task style, format/output adaptation. What LoRA is mediocre at: tasks requiring large reasoning shifts beyond the base model's capability.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
QLoRA: When You're Bound by GPU Memory
QLoRA adds 4-bit quantization of the base model during training — instead of holding the base model in FP16 (which is what LoRA needs), you hold it in INT4, freeing GPU memory for the gradients and optimizer state. The math: a 70B base model in FP16 takes ~140GB. In INT4, ~35GB. That makes 70B fine-tuning fit on a single A100 80GB (versus needing two A100s for FP16 LoRA). Quality cost: minimal in our eval data — typically within 1 point of full LoRA, sometimes indistinguishable. Use QLoRA when: GPU memory is the limiting factor (1-2 A100s available, want to train a 70B model), training cost is a major concern (4-bit math is faster), the workload doesn't require the marginal quality of FP16 LoRA. Don't use QLoRA when: training on smaller models (7B-13B) where FP16 LoRA already fits comfortably, the eval data shows the workload is sensitive to quantization artifacts (rare but possible — surfaces as flat loss curve that plateaus higher than FP16 LoRA's). For most teams, QLoRA is the right default for 70B+ models with limited GPU access. We use it across most large-model engagements.
Full Fine-Tuning: The 10% Where It Actually Wins
Full fine-tuning updates all base model parameters. Computational cost: 5-10x of LoRA. Storage: a full 70B fine-tune is 140GB per variant. Deployment complexity: can't share base weights across multiple fine-tuned variants; each is its own model. So when is it worth all that? Three scenarios in our experience. (1) Fundamental task shift: the base model is a general-purpose chat assistant, you need it to be a code completion model. The reasoning patterns are different enough that LoRA can't adapt them with only ~1% of parameters. (2) Domain-specific reasoning: medical diagnosis requires reasoning patterns that aren't present in the base model's training distribution. Full fine-tuning lets the deeper layers adapt. (3) Extensive new vocabulary: training the model on a domain with significant vocabulary not in the base model's tokenizer's frequent terms (e.g., chemistry, specialized engineering). Full fine-tuning allows the embedding layer to fully adapt. Real example from a NKKTech client: medical-coding agent for a Southeast Asian healthcare client. LoRA achieved 78% F1 on the eval suite; full fine-tuning achieved 86%. The 8-point lift justified the 10x compute cost because the client's deployment had a clear minimum quality bar (75%) that LoRA was hitting but with no headroom. Full fine-tuning gave the safety margin.
The Decision Framework: Eval-Driven Method Selection
Don't pick the method upfront — let the eval data tell you. Practical workflow we use. Step 1: build the eval set FIRST (200-500 examples of (input, expected_output, quality_rubric)). Step 2: baseline the base model with no fine-tuning, just prompt engineering. Note the eval score. Step 3: train LoRA with the default config above. Note the eval score. Step 4: if LoRA hits your quality target, ship it. Done. Step 5: if LoRA doesn't hit target, ask why — is it (a) consistently wrong on the same kind of example (suggests fundamental reasoning gap → full fine-tuning), or (b) hitting a noise floor near target (suggests the task is at the edge of what's possible → either accept the floor or generate more training data). Step 6: only if (a), escalate to full fine-tuning. Most engagements stop at Step 4 — LoRA is sufficient. This eval-driven workflow saves ~80% of compute cost compared to defaulting to full fine-tuning. For the broader decision framework on when to fine-tune vs when to use RAG or prompt engineering, see our Fine-tuning vs RAG vs Prompt Engineering Guide.
Production Deployment Patterns Per Method
How you ship the fine-tuned model depends on the method. LoRA / QLoRA: ship the base model + LoRA adapter separately. At inference, either merge the adapter into the base weights (one-time cost, then runs as base model), or use a runtime like vLLM with LoRA support that swaps adapters per request (allows multiple fine-tuned variants to share base weights, useful for multi-tenant deployments). Cost: marginal increase over base model inference. Full fine-tuning: ship the full fine-tuned model. Each variant is its own deployment. Cost: same as base model inference, but if you have multiple variants you're paying for separate model deployments rather than sharing. Hosting options: AWS Bedrock supports both LoRA and full fine-tuned model deployment via its custom model import feature. Azure OpenAI fine-tuning supports full fine-tuning on GPT-3.5 and GPT-4 via their API. Together AI supports both LoRA and full fine-tuned model hosting for open-source models at reasonable cost. Self-hosted with vLLM: most flexible for LoRA (multi-adapter serving), reasonable for full fine-tuning, requires more ops effort. We default to AWS Bedrock for client engagements that need enterprise compliance, Together AI for cost-optimized engagements, self-hosted vLLM when the client wants full infrastructure control.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Want to build this with NKKTech?
Planning a fine-tuning engagement and want to scope the method selection? Book a free 30-minute call. We'll look at your eval criteria, training data, and quality targets and recommend whether LoRA, QLoRA, or full fine-tuning is the right starting point.
Book a Free Call