Skip to main content

Choosing the Right LLM Stack for Your Project

Building an LLM application today is not about picking a single model. It's about assembling a technology stack—a layered collection of components that handle everything from the user prompt to the final response, and from development to production operations.

You'll need to decide whether to use a hosted API or self-host an open-source model. You'll need to choose whether to ground responses with retrieval-augmented generation (RAG), and if so, which vector database, embedding model, and chunking strategy to adopt. You'll need an orchestration layer to manage prompts, tool calls, and workflows. And you'll need deployment, monitoring, and security infrastructure to run reliably in production.

This article walks through the major architectural decisions behind each of these choices. The goal is not to prescribe a single "best" stack, but to give you a framework for thinking about trade-offs—so that your stack fits your use case, your team, and your budget.

What Is an LLM Stack?​

An LLM stack is the complete collection of technologies required to build and operate a production AI system. It spans from the user interface down to the infrastructure:

  • Frontend / API: The interface users or other services interact with.
  • Application Logic: Orchestrates prompts, retrieval, tool calls, and response handling.
  • Prompt Engineering: The templates and strategies that control model behavior.
  • RAG: Retrieves external knowledge to ground responses.
  • Vector Database: Stores and searches embeddings for RAG.
  • LLM / Model Provider: The foundation model itself—hosted API or self-hosted.
  • Inference Engine: The runtime that serves the model (vLLM, TensorRT‑LLM, TGI).
  • Infrastructure: Compute, networking, and storage.
  • LLMOps: Deployment, monitoring, observability, and cost management.
  • Monitoring & Security: Cross‑cutting concerns that protect the system and its users.

Every layer involves architectural choices. Let's walk through them step by step.

Step 1 — Choose Your Foundation Model​

The foundation model is the engine of your application. Your first major decision is where it runs and who manages it.

OptionDescriptionStrengthsWeaknesses
Hosted APIPay‑per‑token access to models like GPT‑4o, Claude, Gemini.Zero infrastructure, always up‑to‑date, simple to start.Data leaves your network, per‑token cost scales, limited control.
Open‑source (self‑hosted)Deploy Llama, Mistral, DeepSeek, or Qwen on your own hardware.Data privacy, fixed cost, full control over model and inference.Requires GPU infrastructure, operational complexity, model updates are your responsibility.
Private cloud / managed open‑sourceCloud services that host open‑source models for you (e.g., Azure AI, AWS Bedrock).Balance of privacy and managed operations.Vendor lock‑in risk, cost varies.

Consider these factors when choosing:

  • Performance & reasoning capability: Does the model need to handle complex, multi‑step reasoning, or are simple completions sufficient?
  • Latency: Hosted APIs often have lower time‑to‑first‑token; self‑hosted models can be optimized for throughput but require GPU provisioning.
  • Privacy & compliance: Does your data need to stay on‑premises or within a specific region?
  • Context window: Do you need 128K+ tokens for long documents?
  • Cost: Hosted API costs scale per token; self‑hosted costs are fixed (GPU rental) but require minimum commitment.
  • Licensing: Some open‑source models restrict commercial use. Check the license.

Typical use cases:

  • Prototypes and startups: Start with a hosted API to minimize time‑to‑market.
  • Enterprises with sensitive data: Self‑host or use a managed private cloud.
  • High‑volume, cost‑sensitive applications: Self‑host to cap costs as volume grows.

Step 2 — Decide Whether You Need RAG​

Prompt engineering alone works when:

  • The task relies only on general knowledge the model already knows.
  • You can fit all necessary context into the prompt within the context window.
  • You don't need frequent knowledge updates.

But many production applications require information the model doesn't have: internal documents, recent events, or proprietary data. That's where RAG becomes essential.

Scenarios that demand RAG:

  • Enterprise knowledge bases and internal wikis.
  • Customer support systems answering from documentation.
  • Legal and financial analysis requiring source citations.
  • Any application where factual accuracy is critical and the knowledge base changes frequently.

RAG is preferred over fine‑tuning for knowledge updates because you can add, modify, or remove documents in the vector database instantly—no retraining required.

Step 3 — Select a Retrieval Stack​

If RAG is part of your architecture, you'll need to choose a retrieval stack. Each component affects retrieval quality:

  • Embedding Models: Convert text to vectors. Choose based on retrieval quality (MTEB benchmarks), language support, and cost. General‑purpose models like text-embedding-3 work broadly; domain‑specific models (BGE, E5) may perform better on specialized text.
  • Vector Database: Stores and searches embeddings. Options range from managed cloud services (Pinecone) to open‑source (Milvus, Qdrant, Weaviate) to PostgreSQL extensions (pgvector). Choose based on scale, latency targets, and whether you need built‑in hybrid search.
  • Chunking Strategy: How you split documents affects everything downstream. Start with recursive chunking at 512 tokens; iterate based on retrieval evaluation.
  • Dense vs Sparse vs Hybrid Retrieval: Dense (vector) search captures semantics; sparse (BM25) catches exact keywords and identifiers. Hybrid combines both for robustness. Most production systems eventually adopt hybrid search.
  • Reranking: A second‑stage cross‑encoder model re‑orders retrieved candidates to improve precision. Strongly recommended for production RAG.

For a deep dive into each of these, the RAG Handbook covers them in detail.

Step 4 — Choose an Orchestration Framework​

The orchestration layer manages prompts, retrieval, tool calling, memory, and evaluation. Instead of writing everything from scratch, most teams adopt a framework.

Selection criteria:

  • Simplicity vs flexibility: Lightweight libraries give you control; heavy frameworks provide abstractions but can be harder to debug.
  • Ecosystem: Compatibility with your chosen vector database, LLM provider, and evaluation tools matters.
  • Production readiness: Does the framework support streaming, caching, retries, and observability out of the box?

Evaluate frameworks based on your team's familiarity and the complexity of your application. A simple prototype might need only direct API calls; a multi‑step agent system likely needs a robust orchestration layer.

Step 5 — Decide Whether Fine‑Tuning Is Necessary​

Fine‑tuning is the most resource‑intensive adaptation technique. It's often not the right first choice.

TechniqueModifies Model?Updates Knowledge?CostTypical Use Case
Prompt EngineeringNoNoMinimalSimple behavior adjustments, formatting
RAGNoYes (external)ModerateKnowledge grounding, enterprise search
Fine‑TuningYesYes (internalized)HighDeep domain adaptation, consistent behavior

Most applications start with prompting and RAG. Consider fine‑tuning when:

  • You need consistent output formatting that prompting can't reliably achieve.
  • The model must deeply understand specialized terminology (medical, legal).
  • You've exhausted prompting and RAG improvements and need a step‑change in quality.
  • You can afford the training cost and have a high‑quality dataset.

For many teams, fine‑tuning is a Phase 2 optimization, not a launch requirement.

Step 6 — Select a Deployment Strategy​

Where your application runs affects latency, privacy, and operations:

StrategyLatencyPrivacyComplexityCost Model
Hosted APILow‑moderateData leaves networkLowPer‑token
Private Cloud (managed)LowData stays in VPCModerateSubscription + usage
On‑PremisesVery low (local)MaximumHighFixed (hardware)
Edge DeploymentUltra‑lowMaximumHighFixed (device)
  • Use hosted APIs for prototypes, low‑volume internal tools, or when your team has no GPU expertise.
  • Use private cloud for enterprise applications with compliance requirements.
  • Use on‑premises or edge when latency must be minimal (real‑time applications) or data must never leave the device.

Step 7 — Plan for Production Operations​

A deployed AI system needs continuous operations to remain reliable:

  • Evaluation: Measure retrieval quality (context precision/recall) and generation quality (faithfulness, relevancy) before every deployment.
  • Monitoring: Track latency, token usage, error rates, and hallucination frequency in real time.
  • Observability: Trace every request through prompting, retrieval, and generation to debug failures.
  • Reliability: Implement retries, fallbacks, caching, and rate limiting.
  • Cost optimization: Use semantic caching, prompt compression, and model routing to control expenses.
  • Security: Guard against prompt injection, jailbreaks, and data leakage at every layer.

Operational excellence is what separates a working demo from a trusted production service.

Example Technology Stacks​

Use CaseFoundation ModelRetrievalDeploymentKey Considerations
Personal AI AssistantHosted API (GPT‑4o, Claude)None or simple RAGHosted APIFast setup, low maintenance
Enterprise Knowledge AssistantOpen‑source (Llama 3) + self‑hosted or managed cloudHybrid search + reranking + vector DBPrivate cloudData privacy, citation accuracy
Customer Support BotHosted APIRAG with internal docs, hybrid searchHosted API + managed vector DBFreshness of knowledge base, cost at scale
Code AssistantFine‑tuned open‑source (DeepSeek‑Coder)Code‑specific RAG with AST chunkingSelf‑hosted GPULatency, code syntax awareness
Internal Search PlatformOpen‑source (Llama 3)Hybrid RAG + metadata filteringOn‑premises or private cloudSecurity, compliance, scalability

These are starting points. Your own requirements will shape the final stack.

Common Architecture Patterns​

Simple LLM Application​

Prompt → LLM

For straightforward tasks that need only general knowledge. Fast to build, but limited in accuracy and freshness.

Knowledge Assistant​

Prompt → RAG → LLM

Adds retrieval for grounding. The most common pattern for enterprise applications.

Enterprise AI Platform​

Users → API → Orchestration → RAG → LLM → Monitoring

Adds orchestration for multi‑step workflows, tool calling, and production observability.

Private AI Deployment​

Enterprise Data → Private Infrastructure → Local Models → Internal Applications

All data stays within the organization's network. Suitable for highly regulated industries.

Common Mistakes When Choosing an LLM Stack​

  • Selecting tools before understanding requirements. Start with the problem, not the technology. What data do you need? What latency is acceptable? What's your volume?
  • Choosing the largest model unnecessarily. Larger models are more capable but significantly more expensive and slower. Right‑size your model.
  • Ignoring latency. Users tolerate 200ms for a correct answer, not 5 seconds. Measure end‑to‑end latency from day one.
  • Ignoring operational costs. Token costs can spiral; GPU instances are expensive when idle. Model hosting, vector databases, and monitoring all add to the bill.
  • Overusing fine‑tuning. Fine‑tuning is powerful but hard to maintain. Use it when you've exhausted simpler options.
  • Neglecting evaluation. Without systematic metrics, you can't know if a stack change improved or degraded your system.
  • Overlooking security. Prompt injection and data leakage are real threats. Design security in from the start, not as an afterthought.
  • Underestimating monitoring needs. LLM systems degrade silently. You need monitoring to know when outputs become less faithful, costs spike, or latency degrades.

Use this checklist when starting a new project:

  • What problem are you solving? (Generation, search, classification, chat?)
  • Do you need private or proprietary data to answer questions?
  • Is real‑time or frequently updated knowledge required?
  • What latency is acceptable for your users?
  • What is the expected query volume?
  • What compliance requirements exist (GDPR, HIPAA, SOC 2)?
  • How much operational complexity can your team manage?
  • What is your budget for tokens, infrastructure, and operations?

Let these answers guide your stack choices. Technology trends are interesting; business requirements are binding.

Relationship to the LLM Handbook​

This article provides an architectural overview. Each layer is explored in depth elsewhere:

  • Foundations: How LLMs work—the prerequisite for informed stack decisions.
  • Prompt Engineering: The first and cheapest adaptation layer.
  • RAG: The retrieval stack in full detail, from chunking to reranking.
  • Fine‑Tuning: When and how to adapt models with additional training.
  • LLMOps: Operating your stack in production—deployment, monitoring, evaluation, cost.
  • Security: Protecting every layer of your AI stack.

What You'll Learn​

By understanding the decisions outlined here, you'll be equipped to:

  • Understand the layers of the modern LLM technology stack.
  • Choose the right architecture for different types of projects.
  • Decide when RAG is necessary and what retrieval stack to build.
  • Compare deployment strategies and their implications.
  • Understand the role of orchestration frameworks.
  • Avoid common architectural mistakes that plague early AI projects.
  • Design scalable, secure, and cost‑effective production AI systems.

Start with the Foundations section to build a deep understanding of how LLMs work. From there, explore each layer of the stack as your project demands. The right technology choices are rarely obvious at the start—they emerge from understanding trade‑offs, running experiments, and measuring results.