Choosing the Right LLM Stack for Your Project

Building an LLM application today is not about picking a single model. It's about assembling a technology stack—a layered collection of components that handle everything from the user prompt to the final response, and from development to production operations.

You'll need to decide whether to use a hosted API or self-host an open-source model. You'll need to choose whether to ground responses with retrieval-augmented generation (RAG), and if so, which vector database, embedding model, and chunking strategy to adopt. You'll need an orchestration layer to manage prompts, tool calls, and workflows. And you'll need deployment, monitoring, and security infrastructure to run reliably in production.

This article walks through the major architectural decisions behind each of these choices. The goal is not to prescribe a single "best" stack, but to give you a framework for thinking about trade-offs—so that your stack fits your use case, your team, and your budget.

What Is an LLM Stack?

An LLM stack is the complete collection of technologies required to build and operate a production AI system. It spans from the user interface down to the infrastructure:

Frontend / API: The interface users or other services interact with.
Application Logic: Orchestrates prompts, retrieval, tool calls, and response handling.
Prompt Engineering: The templates and strategies that control model behavior.
RAG: Retrieves external knowledge to ground responses.
Vector Database: Stores and searches embeddings for RAG.
LLM / Model Provider: The foundation model itself—hosted API or self-hosted.
Inference Engine: The runtime that serves the model (vLLM, TensorRT‑LLM, TGI).
Infrastructure: Compute, networking, and storage.
LLMOps: Deployment, monitoring, observability, and cost management.
Monitoring & Security: Cross‑cutting concerns that protect the system and its users.

Every layer involves architectural choices. Let's walk through them step by step.

Step 1 — Choose Your Foundation Model

The foundation model is the engine of your application. Your first major decision is where it runs and who manages it.

Option	Description	Strengths	Weaknesses
Hosted API	Pay‑per‑token access to models like GPT‑4o, Claude, Gemini.	Zero infrastructure, always up‑to‑date, simple to start.	Data leaves your network, per‑token cost scales, limited control.
Open‑source (self‑hosted)	Deploy Llama, Mistral, DeepSeek, or Qwen on your own hardware.	Data privacy, fixed cost, full control over model and inference.	Requires GPU infrastructure, operational complexity, model updates are your responsibility.
Private cloud / managed open‑source	Cloud services that host open‑source models for you (e.g., Azure AI, AWS Bedrock).	Balance of privacy and managed operations.	Vendor lock‑in risk, cost varies.

Consider these factors when choosing:

Performance & reasoning capability: Does the model need to handle complex, multi‑step reasoning, or are simple completions sufficient?
Latency: Hosted APIs often have lower time‑to‑first‑token; self‑hosted models can be optimized for throughput but require GPU provisioning.
Privacy & compliance: Does your data need to stay on‑premises or within a specific region?
Context window: Do you need 128K+ tokens for long documents?
Cost: Hosted API costs scale per token; self‑hosted costs are fixed (GPU rental) but require minimum commitment.
Licensing: Some open‑source models restrict commercial use. Check the license.

Typical use cases:

Prototypes and startups: Start with a hosted API to minimize time‑to‑market.
Enterprises with sensitive data: Self‑host or use a managed private cloud.
High‑volume, cost‑sensitive applications: Self‑host to cap costs as volume grows.

Step 2 — Decide Whether You Need RAG

Prompt engineering alone works when:

The task relies only on general knowledge the model already knows.
You can fit all necessary context into the prompt within the context window.
You don't need frequent knowledge updates.

But many production applications require information the model doesn't have: internal documents, recent events, or proprietary data. That's where RAG becomes essential.

Scenarios that demand RAG:

Enterprise knowledge bases and internal wikis.
Customer support systems answering from documentation.
Legal and financial analysis requiring source citations.
Any application where factual accuracy is critical and the knowledge base changes frequently.

RAG is preferred over fine‑tuning for knowledge updates because you can add, modify, or remove documents in the vector database instantly—no retraining required.

Step 3 — Select a Retrieval Stack

If RAG is part of your architecture, you'll need to choose a retrieval stack. Each component affects retrieval quality:

Embedding Models: Convert text to vectors. Choose based on retrieval quality (MTEB benchmarks), language support, and cost. General‑purpose models like text-embedding-3 work broadly; domain‑specific models (BGE, E5) may perform better on specialized text.
Vector Database: Stores and searches embeddings. Options range from managed cloud services (Pinecone) to open‑source (Milvus, Qdrant, Weaviate) to PostgreSQL extensions (pgvector). Choose based on scale, latency targets, and whether you need built‑in hybrid search.
Chunking Strategy: How you split documents affects everything downstream. Start with recursive chunking at 512 tokens; iterate based on retrieval evaluation.
Dense vs Sparse vs Hybrid Retrieval: Dense (vector) search captures semantics; sparse (BM25) catches exact keywords and identifiers. Hybrid combines both for robustness. Most production systems eventually adopt hybrid search.
Reranking: A second‑stage cross‑encoder model re‑orders retrieved candidates to improve precision. Strongly recommended for production RAG.

For a deep dive into each of these, the RAG Handbook covers them in detail.

Step 4 — Choose an Orchestration Framework

The orchestration layer manages prompts, retrieval, tool calling, memory, and evaluation. Instead of writing everything from scratch, most teams adopt a framework.

Selection criteria:

Simplicity vs flexibility: Lightweight libraries give you control; heavy frameworks provide abstractions but can be harder to debug.
Ecosystem: Compatibility with your chosen vector database, LLM provider, and evaluation tools matters.
Production readiness: Does the framework support streaming, caching, retries, and observability out of the box?

Evaluate frameworks based on your team's familiarity and the complexity of your application. A simple prototype might need only direct API calls; a multi‑step agent system likely needs a robust orchestration layer.

Step 5 — Decide Whether Fine‑Tuning Is Necessary

Fine‑tuning is the most resource‑intensive adaptation technique. It's often not the right first choice.

Technique	Modifies Model?	Updates Knowledge?	Cost	Typical Use Case
Prompt Engineering	No	No	Minimal	Simple behavior adjustments, formatting
RAG	No	Yes (external)	Moderate	Knowledge grounding, enterprise search
Fine‑Tuning	Yes	Yes (internalized)	High	Deep domain adaptation, consistent behavior

Most applications start with prompting and RAG. Consider fine‑tuning when:

You need consistent output formatting that prompting can't reliably achieve.
The model must deeply understand specialized terminology (medical, legal).
You've exhausted prompting and RAG improvements and need a step‑change in quality.
You can afford the training cost and have a high‑quality dataset.

For many teams, fine‑tuning is a Phase 2 optimization, not a launch requirement.

Step 6 — Select a Deployment Strategy

Where your application runs affects latency, privacy, and operations:

Strategy	Latency	Privacy	Complexity	Cost Model
Hosted API	Low‑moderate	Data leaves network	Low	Per‑token
Private Cloud (managed)	Low	Data stays in VPC	Moderate	Subscription + usage
On‑Premises	Very low (local)	Maximum	High	Fixed (hardware)
Edge Deployment	Ultra‑low	Maximum	High	Fixed (device)

Use hosted APIs for prototypes, low‑volume internal tools, or when your team has no GPU expertise.
Use private cloud for enterprise applications with compliance requirements.
Use on‑premises or edge when latency must be minimal (real‑time applications) or data must never leave the device.

Step 7 — Plan for Production Operations

A deployed AI system needs continuous operations to remain reliable:

Evaluation: Measure retrieval quality (context precision/recall) and generation quality (faithfulness, relevancy) before every deployment.
Monitoring: Track latency, token usage, error rates, and hallucination frequency in real time.
Observability: Trace every request through prompting, retrieval, and generation to debug failures.
Reliability: Implement retries, fallbacks, caching, and rate limiting.
Cost optimization: Use semantic caching, prompt compression, and model routing to control expenses.
Security: Guard against prompt injection, jailbreaks, and data leakage at every layer.

Operational excellence is what separates a working demo from a trusted production service.

Example Technology Stacks

Use Case	Foundation Model	Retrieval	Deployment	Key Considerations
Personal AI Assistant	Hosted API (GPT‑4o, Claude)	None or simple RAG	Hosted API	Fast setup, low maintenance
Enterprise Knowledge Assistant	Open‑source (Llama 3) + self‑hosted or managed cloud	Hybrid search + reranking + vector DB	Private cloud	Data privacy, citation accuracy
Customer Support Bot	Hosted API	RAG with internal docs, hybrid search	Hosted API + managed vector DB	Freshness of knowledge base, cost at scale
Code Assistant	Fine‑tuned open‑source (DeepSeek‑Coder)	Code‑specific RAG with AST chunking	Self‑hosted GPU	Latency, code syntax awareness
Internal Search Platform	Open‑source (Llama 3)	Hybrid RAG + metadata filtering	On‑premises or private cloud	Security, compliance, scalability

These are starting points. Your own requirements will shape the final stack.

Common Architecture Patterns

Simple LLM Application

Prompt → LLM

For straightforward tasks that need only general knowledge. Fast to build, but limited in accuracy and freshness.

Knowledge Assistant

Prompt → RAG → LLM

Adds retrieval for grounding. The most common pattern for enterprise applications.

Enterprise AI Platform

Users → API → Orchestration → RAG → LLM → Monitoring

Adds orchestration for multi‑step workflows, tool calling, and production observability.

Private AI Deployment

Enterprise Data → Private Infrastructure → Local Models → Internal Applications

All data stays within the organization's network. Suitable for highly regulated industries.

Common Mistakes When Choosing an LLM Stack

Selecting tools before understanding requirements. Start with the problem, not the technology. What data do you need? What latency is acceptable? What's your volume?
Choosing the largest model unnecessarily. Larger models are more capable but significantly more expensive and slower. Right‑size your model.
Ignoring latency. Users tolerate 200ms for a correct answer, not 5 seconds. Measure end‑to‑end latency from day one.
Ignoring operational costs. Token costs can spiral; GPU instances are expensive when idle. Model hosting, vector databases, and monitoring all add to the bill.
Overusing fine‑tuning. Fine‑tuning is powerful but hard to maintain. Use it when you've exhausted simpler options.
Neglecting evaluation. Without systematic metrics, you can't know if a stack change improved or degraded your system.
Overlooking security. Prompt injection and data leakage are real threats. Design security in from the start, not as an afterthought.
Underestimating monitoring needs. LLM systems degrade silently. You need monitoring to know when outputs become less faithful, costs spike, or latency degrades.

Recommended Decision Framework

Use this checklist when starting a new project:

What problem are you solving? (Generation, search, classification, chat?)
Do you need private or proprietary data to answer questions?
Is real‑time or frequently updated knowledge required?
What latency is acceptable for your users?
What is the expected query volume?
What compliance requirements exist (GDPR, HIPAA, SOC 2)?
How much operational complexity can your team manage?
What is your budget for tokens, infrastructure, and operations?

Let these answers guide your stack choices. Technology trends are interesting; business requirements are binding.

Relationship to the LLM Handbook

This article provides an architectural overview. Each layer is explored in depth elsewhere:

Foundations: How LLMs work—the prerequisite for informed stack decisions.
Prompt Engineering: The first and cheapest adaptation layer.
RAG: The retrieval stack in full detail, from chunking to reranking.
Fine‑Tuning: When and how to adapt models with additional training.
LLMOps: Operating your stack in production—deployment, monitoring, evaluation, cost.
Security: Protecting every layer of your AI stack.

What You'll Learn

By understanding the decisions outlined here, you'll be equipped to:

Understand the layers of the modern LLM technology stack.
Choose the right architecture for different types of projects.
Decide when RAG is necessary and what retrieval stack to build.
Compare deployment strategies and their implications.
Understand the role of orchestration frameworks.
Avoid common architectural mistakes that plague early AI projects.
Design scalable, secure, and cost‑effective production AI systems.

Start with the Foundations section to build a deep understanding of how LLMs work. From there, explore each layer of the stack as your project demands. The right technology choices are rarely obvious at the start—they emerge from understanding trade‑offs, running experiments, and measuring results.

What Is an LLM Stack?​

Step 1 — Choose Your Foundation Model​

Step 2 — Decide Whether You Need RAG​

Step 3 — Select a Retrieval Stack​

Step 4 — Choose an Orchestration Framework​

Step 5 — Decide Whether Fine‑Tuning Is Necessary​

Step 6 — Select a Deployment Strategy​

Step 7 — Plan for Production Operations​

Example Technology Stacks​

Common Architecture Patterns​

Simple LLM Application​

Knowledge Assistant​

Enterprise AI Platform​

Private AI Deployment​

Common Mistakes When Choosing an LLM Stack​

Recommended Decision Framework​

Relationship to the LLM Handbook​

What You'll Learn​