Skip to main content

The LLM Lifecycle

In Lesson 1, you learned what LLMOps is and why proprietary intelligence matters. Now you need to understand how custom models come to life—the complete journey from raw data to production deployment.

The LLM lifecycle isn't a one-time process. It's a continuous loop where each stage feeds the next, and production monitoring triggers new iterations. Understanding this lifecycle is the foundation for everything else in Part 8.

The Five Stages

Every custom model project moves through five stages:

┌────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ │
│ │ DATA │────▶│TRAINING │────▶│EVALUATION│────▶│DEPLOYMENT│ │
│ │CURATION │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ │ ┌──────────┐ │
│ └──────────────│MONITORING│◀──────────────────────────────┘│
│ feedback │ │ │
│ loop └──────────┘ │
│ │
└────────────────────────────────────────────────────────────────────┘

Let's examine each stage in detail.

Stage 1: Data Curation

The question: What data will teach the model what we need it to know?

Data curation is where most LLMOps projects succeed or fail. The quality of your custom model is directly determined by the quality of your training data.

What Happens in Data Curation

ActivityPurposeKey Considerations
CollectionGather raw data from sourcesWhat sources represent the target behavior?
CleaningRemove noise, errors, duplicatesWhat quality threshold is acceptable?
FormattingConvert to training-ready formatWhat format does the training framework expect?
AnnotationAdd labels, preferences, feedbackWho annotates? How do we ensure consistency?
Safety ReviewCheck for harmful, biased, or sensitive contentWhat could go wrong if this data is learned?

The Data Format Question

Different training approaches require different data formats:

Supervised Fine-Tuning (SFT):

{
"messages": [
{"role": "user", "content": "How do I prioritize tasks?"},
{"role": "assistant", "content": "The Eisenhower Matrix works well..."}
]
}

Preference Tuning (DPO):

{
"prompt": "How do I prioritize tasks?",
"chosen": "The Eisenhower Matrix categorizes by urgency and importance...",
"rejected": "Just do whatever feels most important to you."
}

Instruction Tuning:

{
"instruction": "Explain task prioritization",
"input": "",
"output": "Task prioritization involves..."
}

The format you choose depends on what you're trying to teach the model. We'll dive deep into data preparation in Chapter 63.

Data Curation Pitfalls

PitfallConsequencePrevention
Too little dataModel doesn't learn patternMinimum 1,000 examples for simple tasks
Biased dataModel reproduces biasesAudit data for representation issues
Low quality dataModel learns errorsHuman review of sample
Data leakageEvaluation metrics inflatedStrict train/test separation
PII in dataPrivacy/compliance violationsAutomated PII detection and removal

Task API Example: Data Curation

For our Task Management Assistant, data curation might involve:

  1. Collection: Export 10,000 user conversations with the Task API
  2. Cleaning: Remove duplicates, fix malformed messages, filter spam
  3. Formatting: Convert to conversation format with user/assistant turns
  4. Annotation: Mark which responses best exemplify target behavior
  5. Safety Review: Remove any personal information, check for harmful patterns

Output: A clean dataset of 5,000 high-quality task management conversations.

Stage 2: Training

The question: How do we update the model's weights to encode our knowledge?

Training is where the magic happens—but it's also where things can go expensively wrong. Understanding training approaches helps you choose the right method for your use case.

Training Approaches

ApproachWhat It DoesWhen to UseCompute Cost
Full Fine-TuningUpdates all model weightsFundamental behavior changesVery High
LoRA/QLoRAUpdates small adapter layersMost use casesLow
Prompt TuningLearns soft prompt tokensMinor adjustmentsVery Low
Continued PretrainingExtends base knowledgeNew domain vocabularyHigh

For most LLMOps projects, LoRA (Low-Rank Adaptation) hits the sweet spot. It achieves 90-95% of full fine-tuning quality at 1-5% of the compute cost.

Training Configuration Decisions

DecisionOptionsTradeoff
Base ModelLlama, Mistral, Qwen, GemmaCapability vs. size vs. license
Model Size7B, 13B, 70B parametersQuality vs. inference cost
Training MethodFull, LoRA, QLoRAQuality vs. compute cost
Learning Rate1e-5 to 1e-4 typicalFast learning vs. stability
Epochs1-5 typicalUnderfitting vs. overfitting

The Overfitting Problem

A critical concept: overfitting means the model memorizes training data instead of learning generalizable patterns.

Signs of overfitting:

  • Training loss keeps decreasing
  • Validation loss starts increasing
  • Model outputs training examples verbatim
  • Poor performance on new, unseen inputs

Prevention:

  • Regularization (dropout, weight decay)
  • Early stopping (stop when validation loss rises)
  • Sufficient data diversity
  • Proper train/validation/test splits

Task API Example: Training

For our Task Assistant, training might involve:

  1. Base Model Selection: Qwen2.5-7B (good instruction following, permissive license)
  2. Method: QLoRA (4-bit quantization, runs on 16GB GPU)
  3. Configuration: Learning rate 2e-4, 3 epochs, batch size 4
  4. Duration: ~2 hours on consumer GPU

Output: A fine-tuned model checkpoint with task management knowledge.

Stage 3: Evaluation

The question: Does this model meet our quality requirements?

Evaluation is where you determine if training succeeded. Unlike traditional ML with clear accuracy metrics, LLM evaluation is multidimensional.

Evaluation Dimensions

DimensionWhat It MeasuresHow to Measure
Task PerformanceDoes it solve the target task?Task-specific benchmarks, human eval
Instruction FollowingDoes it follow instructions correctly?Instruction benchmarks (IFEval)
SafetyDoes it avoid harmful outputs?Safety benchmarks, red teaming
TruthfulnessDoes it avoid hallucinations?TruthfulQA, factual accuracy tests
CoherenceDoes output make sense?Perplexity, human rating

The Evaluation Pipeline

┌─────────────────────────────────────────────────────────────┐
│ EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Trained ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ Model ───▶│ AUTOMATED │─▶│ HUMAN │─▶│ SAFETY │───▶? │
│ │ METRICS │ │ EVAL │ │ CHECKS │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Perplexity Preference Red team │
│ Accuracy ratings results │
│ BLEU/ROUGE Quality Safety │
│ scores score │
│ │
│ GATE: All metrics must pass thresholds │
│ │
└─────────────────────────────────────────────────────────────┘

Automated vs. Human Evaluation

Evaluation TypeStrengthsWeaknesses
Automated MetricsFast, consistent, cheapMay not correlate with quality
LLM-as-JudgeScalable, nuancedCan inherit evaluator biases
Human EvaluationGround truth for qualityExpensive, slow, inconsistent

Best Practice: Automated metrics as gates (must pass to proceed), human evaluation for final quality assessment, LLM-as-Judge for scaling human-like evaluation.

Evaluation Metrics Explained

MetricWhat It MeasuresTypical Threshold
PerplexityHow "surprised" model is by textLower is better; compare to baseline
Task AccuracyCorrect answers on domain tasks80%+ for most applications
Win RateHuman prefers new model vs. baseline60%+ to deploy
Safety ScorePasses safety evaluations95%+ required

Task API Example: Evaluation

For our Task Assistant, evaluation might include:

  1. Automated Metrics:

    • Task completion accuracy on 500 held-out examples
    • Perplexity on task management corpus
  2. LLM-as-Judge:

    • Compare responses to baseline (Claude API)
    • Rate helpfulness, accuracy, tone
  3. Human Evaluation:

    • 50 random responses rated by domain experts
    • Check for hallucinated task features
  4. Safety Checks:

    • Red team prompts for task-related edge cases
    • PII leakage testing

Gate: Must achieve 85%+ task accuracy, 60%+ win rate vs. baseline, 95%+ safety score.

Stage 4: Deployment

The question: How do we serve this model to users reliably?

Deployment transforms a model checkpoint into a production service. This stage involves infrastructure, scaling, and operational concerns.

Deployment Options

OptionDescriptionBest For
Self-HostedRun on your own GPU serversFull control, data sovereignty
Managed PlatformsTuring, Replicate, ModalMinimal ops overhead
ServerlessOn-demand inferenceVariable traffic patterns
Edge DeploymentOn-device inferenceLatency-critical, offline needs

Deployment Considerations

ConcernQuestionTypical Requirement
LatencyHow fast must responses be?<500ms for interactive, <2s for batch
ThroughputHow many requests per second?Size infrastructure accordingly
AvailabilityWhat uptime is required?99.9% for production services
CostWhat's the budget per request?Balance model size vs. cost
ScalingHow does traffic vary?Auto-scaling for variable loads

Model Serving Architecture

┌────────────────────────────────────────────────────────────────┐
│ PRODUCTION ARCHITECTURE │
├────────────────────────────────────────────────────────────────┤
│ │
│ Clients Load Model GPU │
│ │ Balancer Servers Cluster │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ API │───▶│ LB │───▶│ vLLM / │───▶│ A100s / │ │
│ │Calls│ │ │ │ TGI │ │ L40s │ │
│ └─────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Request Inference GPU Memory │
│ routing optimization management │
│ │
└────────────────────────────────────────────────────────────────┘

Task API Example: Deployment

For our Task Assistant:

  1. Platform: Managed inference on Turing (start simple)
  2. Endpoint: REST API with OpenAI-compatible format
  3. Scaling: Auto-scale 1-4 replicas based on traffic
  4. Latency Target: <1 second per response
  5. Integration: Connect to Task API as agent backend

Stage 5: Monitoring

The question: Is the model still performing as expected?

Models degrade over time. User behavior shifts. Edge cases emerge. Monitoring catches problems before users complain.

What to Monitor

CategoryMetricsAlert Threshold
LatencyP50, P95, P99 response timeP95 > 2x baseline
ErrorsError rate, error types>1% error rate
QualityLLM-as-Judge scores on sample<80% quality score
SafetyFlagged response rateAny safety flags
CostTokens per request, $/request>budget threshold
DriftDistribution of inputs, outputsSignificant shift

The Feedback Loop

Monitoring isn't just about alerts—it's about continuous improvement:

┌──────────────────────────────────────────────────────────┐
│ FEEDBACK LOOP │
├──────────────────────────────────────────────────────────┤
│ │
│ Monitor ──▶ Detect Issue ──▶ Diagnose ──▶ Action │
│ │ │ │
│ │ ┌─────────────────────────────┐│ │
│ │ │ Data Problem? → Curate more ││ │
│ │ │ Training Gap? → Retrain ││ │
│ └─────────│ Eval Miss? → Add test cases ││ │
│ │ Deployment? → Fix infra │◀┘ │
│ └─────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘

Example Feedback Scenarios:

ObservationDiagnosisAction
"Model hallucinates new feature"Missing from training dataAdd examples to dataset, retrain
"Latency spiked to 3 seconds"Traffic exceeded capacityScale up, optimize serving
"Users report wrong tone"Persona not strong enoughAdd persona examples, retrain
"Safety flags increased"Edge case discoveredAdd safety data, retrain

Task API Example: Monitoring

For our Task Assistant:

  1. Latency: Track P95, alert if >1.5 seconds
  2. Quality: Sample 100 responses/day, run LLM-as-Judge
  3. Safety: Flag any responses mentioning personal data
  4. Feedback: Log user corrections, feed into next training iteration

Stage Interdependencies

The lifecycle isn't linear—stages affect each other:

Upstream Dependencies

If This Stage Has Problems...These Stages Are Affected
Data CurationTraining learns wrong patterns, evaluation misleading, deployment serves bad model
TrainingEvaluation results poor, deployment won't meet requirements
EvaluationDeployment of subpar model, monitoring catches issues too late
DeploymentMonitoring has nothing to observe, no production feedback

Downstream Signals

Monitoring Reveals...Trigger This Stage
Quality degradationData curation (gather new examples)
Missing capabilitiesTraining (add training data, retrain)
Incorrect behaviorEvaluation (add test cases, re-evaluate)
Performance issuesDeployment (optimize serving)

The Continuous Loop

Production LLMOps is never "done." The lifecycle continues:

  1. v1: Initial deployment based on historical data
  2. v2: Retrain after 3 months of production feedback
  3. v3: Add new capabilities based on user requests
  4. v4: Safety improvements after edge case discovery

Each version moves through all five stages, incorporating lessons from previous iterations.

Putting It Together: The LLMOps Project Plan

When planning a custom model project, map activities to lifecycle stages:

StageActivitiesDurationDependencies
Data CurationCollection, cleaning, formatting, review1-4 weeksRaw data access
TrainingConfiguration, training runs, checkpoints1-3 daysCurated data
EvaluationAutomated tests, human eval, safety checks1-2 weeksTrained model
DeploymentInfra setup, endpoint creation, integration1-2 weeksPassing evaluation
MonitoringDashboard setup, alerting, feedback collectionOngoingDeployed model

Total timeline: 4-8 weeks for initial deployment, then continuous iteration.

Try With AI

Use your AI companion to internalize the lifecycle thinking.

Prompt 1: Trace a Problem Through Stages

I'm building a custom model for [your domain]. Help me understand stage
interdependencies by walking through a scenario:

"The model occasionally gives advice that contradicts company policy."

Ask me questions to diagnose:
1. Which lifecycle stage is the root cause?
2. How did this problem escape previous stages?
3. What changes in each stage would prevent this?

Help me think through the full feedback loop.

What you're learning: Root cause analysis across the lifecycle. This is essential for debugging production issues.

Prompt 2: Plan a Custom Model Project

I want to create a custom model for: [describe your use case]

Help me create a rough project plan by asking about each lifecycle stage:
1. Data Curation: What data sources do I have? What quality challenges?
2. Training: What base model and method makes sense?
3. Evaluation: What metrics define success?
4. Deployment: What are my latency/cost constraints?
5. Monitoring: What signals would indicate problems?

Challenge my assumptions—where am I likely to underestimate effort?

What you're learning: Project planning with lifecycle awareness. You're building the skill to scope LLMOps work realistically.

Prompt 3: Identify Lifecycle Stage Questions

For each lifecycle stage, help me develop a checklist of questions I should
ask before moving to the next stage.

Format as:
- Stage: [name]
- Gate Questions: [what must be true to proceed]
- Warning Signs: [what indicates problems]
- Artifacts Produced: [deliverables from this stage]

Make this specific to my domain: [your industry/use case]

What you're learning: Quality gate thinking—the discipline of explicit checkpoints that prevent problems from cascading through the lifecycle.

Safety Note

As you plan LLMOps projects, remember that each stage has safety implications. Data curation determines what the model can learn (including biases). Training amplifies patterns in data (including harmful ones). Evaluation must explicitly test for safety failures. Deployment exposes the model to adversarial inputs. Monitoring must detect safety violations before harm occurs. We'll address safety systematically in Chapter 68, but the foundation starts with recognizing that safety is a lifecycle concern, not a single stage.