Updated Feb 23, 2026

Deployment Automation: From Training to Serving

Your pipeline produces a trained model. Now what? Manual deployment—converting formats, writing configs, testing endpoints—is error-prone and doesn't scale.

Deployment automation transforms your trained artifacts into running services without human intervention. This lesson shows you how to build the bridge from training completion to production serving.

The Deployment Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                       DEPLOYMENT AUTOMATION PIPELINE                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                  │
│   │  TRAINED     │    │   EXPORT     │    │  QUANTIZE    │                  │
│   │  MODEL       │───▶│   (GGUF)     │───▶│  (Q4/Q8)     │                  │
│   │              │    │              │    │              │                  │
│   └──────────────┘    └──────────────┘    └──────────────┘                  │
│                                                  │                           │
│                                                  ▼                           │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                        DEPLOYMENT STAGES                              │  │
│   ├──────────────────────────────────────────────────────────────────────┤  │
│   │                                                                       │  │
│   │   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐           │  │
│   │   │  GENERATE    │    │   DEPLOY     │    │  VALIDATE    │           │  │
│   │   │  MODELFILE   │───▶│   TO OLLAMA  │───▶│   ENDPOINT   │           │  │
│   │   │              │    │              │    │              │           │  │
│   │   └──────────────┘    └──────────────┘    └──────┬───────┘           │  │
│   │                                                  │                    │  │
│   │                              ┌───────────────────┴───────────────┐   │  │
│   │                              │                                   │   │  │
│   │                              ▼                                   ▼   │  │
│   │                       ┌──────────────┐               ┌──────────────┐│  │
│   │                       │   PROMOTE    │               │   ROLLBACK   ││  │
│   │                       │  (Success)   │               │   (Failure)  ││  │
│   │                       └──────────────┘               └──────────────┘│  │
│   │                                                                       │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

deployment/
├── __init__.py
├── config.py              # Deployment configuration
├── export/
│   ├── __init__.py
│   ├── gguf_exporter.py   # GGUF format export
│   └── quantizer.py       # Quantization options
├── ollama/
│   ├── __init__.py
│   ├── modelfile.py       # Modelfile generation
│   └── deployer.py        # Ollama deployment
├── validation/
│   ├── __init__.py
│   ├── health_check.py    # Endpoint health
│   └── smoke_test.py      # Basic functionality
├── automation/
│   ├── __init__.py
│   └── pipeline.py        # Complete automation
└── scripts/
    └── deploy.py          # CLI entry point

Export Configuration

Create deployment/config.py:

from dataclasses import dataclass, field
from typing import List, Optional, Dict
from pathlib import Path
import yaml

@dataclass
class ExportConfig:
    output_format: str = "gguf"  # gguf, safetensors
    quantization: str = "Q4_K_M"  # Q4_K_M, Q8_0, F16
    output_dir: Path = field(default_factory=lambda: Path("./exports"))

@dataclass
class OllamaConfig:
    model_name: str = "task-api-model"
    base_url: str = "http://localhost:11434"
    system_prompt: Optional[str] = None
    temperature: float = 0.7
    context_length: int = 4096

@dataclass
class ValidationConfig:
    health_check_timeout: int = 30
    smoke_test_prompts: List[str] = field(default_factory=list)
    min_response_length: int = 10
    max_latency_ms: float = 5000

@dataclass
class DeploymentConfig:
    export: ExportConfig
    ollama: OllamaConfig
    validation: ValidationConfig
    enable_rollback: bool = True
    previous_model_backup: Optional[str] = None

def load_deployment_config(config_path: str) -> DeploymentConfig:
    """Load deployment configuration from YAML."""
    with open(config_path) as f:
        data = yaml.safe_load(f)

    return DeploymentConfig(
        export=ExportConfig(**data.get("export", {})),
        ollama=OllamaConfig(**data.get("ollama", {})),
        validation=ValidationConfig(**data.get("validation", {})),
        enable_rollback=data.get("enable_rollback", True)
    )

Create configs/deployment_config.yaml:

export:
  output_format: gguf
  quantization: Q4_K_M
  output_dir: ./exports

ollama:
  model_name: task-api-model
  base_url: http://localhost:11434
  system_prompt: |
    You are TaskMaster, an intelligent task management assistant.
    Help users create, manage, and organize their tasks efficiently.
  temperature: 0.7
  context_length: 4096

validation:
  health_check_timeout: 30
  smoke_test_prompts:
    - "Create a task for project review"
    - "List my pending tasks"
    - "What's my highest priority task?"
  min_response_length: 10
  max_latency_ms: 5000

enable_rollback: true

Model Export Automation

Create deployment/export/gguf_exporter.py:

from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import subprocess
import logging
import shutil

logger = logging.getLogger(__name__)

@dataclass
class ExportResult:
    success: bool
    output_path: Optional[Path]
    file_size_gb: float
    export_time_seconds: float
    error: Optional[str] = None

class GGUFExporter:
    """Export models to GGUF format for Ollama."""

    def __init__(self, llama_cpp_path: Optional[str] = None):
        self.llama_cpp_path = llama_cpp_path or self._find_llama_cpp()

    def _find_llama_cpp(self) -> str:
        """Find llama.cpp installation."""
        # Check common locations
        locations = [
            Path.home() / "llama.cpp",
            Path("/opt/llama.cpp"),
            Path("./llama.cpp")
        ]

        for loc in locations:
            if (loc / "convert_hf_to_gguf.py").exists():
                return str(loc)

        raise RuntimeError("llama.cpp not found. Install from https://github.com/ggerganov/llama.cpp")

    def export(
        self,
        model_path: Path,
        output_dir: Path,
        quantization: str = "Q4_K_M"
    ) -> ExportResult:
        """Export HuggingFace model to GGUF format."""
        import time

        start_time = time.time()
        output_dir.mkdir(parents=True, exist_ok=True)

        model_name = model_path.name
        output_file = output_dir / f"{model_name}.gguf"

        try:
            # Step 1: Convert to GGUF (F16)
            logger.info("Converting to GGUF format...")
            f16_output = output_dir / f"{model_name}-f16.gguf"

            convert_cmd = [
                "python",
                f"{self.llama_cpp_path}/convert_hf_to_gguf.py",
                str(model_path),
                "--outfile", str(f16_output),
                "--outtype", "f16"
            ]

            result = subprocess.run(
                convert_cmd,
                capture_output=True,
                text=True,
                timeout=1800  # 30 minutes
            )

            if result.returncode != 0:
                return ExportResult(
                    success=False,
                    output_path=None,
                    file_size_gb=0,
                    export_time_seconds=time.time() - start_time,
                    error=f"Conversion failed: {result.stderr}"
                )

            # Step 2: Quantize
            logger.info(f"Quantizing to {quantization}...")
            quantized_output = output_dir / f"{model_name}-{quantization}.gguf"

            quantize_cmd = [
                f"{self.llama_cpp_path}/llama-quantize",
                str(f16_output),
                str(quantized_output),
                quantization
            ]

            result = subprocess.run(
                quantize_cmd,
                capture_output=True,
                text=True,
                timeout=1800
            )

            if result.returncode != 0:
                return ExportResult(
                    success=False,
                    output_path=None,
                    file_size_gb=0,
                    export_time_seconds=time.time() - start_time,
                    error=f"Quantization failed: {result.stderr}"
                )

            # Clean up F16 file
            f16_output.unlink()

            # Get file size
            file_size = quantized_output.stat().st_size / (1024**3)

            return ExportResult(
                success=True,
                output_path=quantized_output,
                file_size_gb=file_size,
                export_time_seconds=time.time() - start_time
            )

        except subprocess.TimeoutExpired:
            return ExportResult(
                success=False,
                output_path=None,
                file_size_gb=0,
                export_time_seconds=time.time() - start_time,
                error="Export timed out after 30 minutes"
            )
        except Exception as e:
            return ExportResult(
                success=False,
                output_path=None,
                file_size_gb=0,
                export_time_seconds=time.time() - start_time,
                error=str(e)
            )

Modelfile Generation

Create deployment/ollama/modelfile.py:

from pathlib import Path
from typing import Optional
from dataclasses import dataclass

@dataclass
class ModelfileConfig:
    model_name: str
    gguf_path: Path
    system_prompt: Optional[str] = None
    temperature: float = 0.7
    context_length: int = 4096
    stop_tokens: list = None

class ModelfileGenerator:
    """Generate Ollama Modelfiles programmatically."""

    def generate(self, config: ModelfileConfig) -> str:
        """Generate Modelfile content."""
        lines = [
            f"# Modelfile for {config.model_name}",
            f"# Generated automatically by deployment pipeline",
            "",
            f"FROM {config.gguf_path}",
            ""
        ]

        # Add parameters
        lines.append(f"PARAMETER temperature {config.temperature}")
        lines.append(f"PARAMETER num_ctx {config.context_length}")

        if config.stop_tokens:
            for token in config.stop_tokens:
                lines.append(f'PARAMETER stop "{token}"')

        # Add system prompt
        if config.system_prompt:
            lines.append("")
            lines.append("SYSTEM \"\"\"")
            lines.append(config.system_prompt.strip())
            lines.append("\"\"\"")

        return "\n".join(lines)

    def write(self, config: ModelfileConfig, output_path: Path) -> Path:
        """Generate and write Modelfile."""
        content = self.generate(config)
        output_path.write_text(content)
        return output_path

Output (example Modelfile):

# Modelfile for task-api-model
# Generated automatically by deployment pipeline

FROM ./exports/task-api-model-Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """
You are TaskMaster, an intelligent task management assistant.
Help users create, manage, and organize their tasks efficiently.
"""

Ollama Deployer

Create deployment/ollama/deployer.py:

from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import subprocess
import requests
import time
import logging

from .modelfile import ModelfileGenerator, ModelfileConfig
from ..config import OllamaConfig

logger = logging.getLogger(__name__)

@dataclass
class DeployResult:
    success: bool
    model_name: str
    deployment_time_seconds: float
    error: Optional[str] = None

class OllamaDeployer:
    """Deploy models to Ollama."""

    def __init__(self, config: OllamaConfig):
        self.config = config
        self.modelfile_generator = ModelfileGenerator()

    def deploy(
        self,
        gguf_path: Path,
        modelfile_dir: Path
    ) -> DeployResult:
        """Deploy GGUF model to Ollama."""
        start_time = time.time()

        try:
            # Check Ollama is running
            if not self._check_ollama_running():
                return DeployResult(
                    success=False,
                    model_name=self.config.model_name,
                    deployment_time_seconds=time.time() - start_time,
                    error="Ollama server not running"
                )

            # Generate Modelfile
            modelfile_config = ModelfileConfig(
                model_name=self.config.model_name,
                gguf_path=gguf_path,
                system_prompt=self.config.system_prompt,
                temperature=self.config.temperature,
                context_length=self.config.context_length
            )

            modelfile_path = modelfile_dir / "Modelfile"
            self.modelfile_generator.write(modelfile_config, modelfile_path)
            logger.info(f"Generated Modelfile at {modelfile_path}")

            # Create model in Ollama
            logger.info(f"Creating model {self.config.model_name} in Ollama...")
            result = subprocess.run(
                ["ollama", "create", self.config.model_name, "-f", str(modelfile_path)],
                capture_output=True,
                text=True,
                timeout=600  # 10 minutes
            )

            if result.returncode != 0:
                return DeployResult(
                    success=False,
                    model_name=self.config.model_name,
                    deployment_time_seconds=time.time() - start_time,
                    error=f"ollama create failed: {result.stderr}"
                )

            logger.info(f"Model {self.config.model_name} created successfully")

            return DeployResult(
                success=True,
                model_name=self.config.model_name,
                deployment_time_seconds=time.time() - start_time
            )

        except subprocess.TimeoutExpired:
            return DeployResult(
                success=False,
                model_name=self.config.model_name,
                deployment_time_seconds=time.time() - start_time,
                error="Model creation timed out"
            )
        except Exception as e:
            return DeployResult(
                success=False,
                model_name=self.config.model_name,
                deployment_time_seconds=time.time() - start_time,
                error=str(e)
            )

    def _check_ollama_running(self) -> bool:
        """Check if Ollama server is running."""
        try:
            response = requests.get(
                f"{self.config.base_url}/api/tags",
                timeout=5
            )
            return response.status_code == 200
        except requests.RequestException:
            return False

    def rollback(self, previous_model_name: str) -> bool:
        """Rollback to previous model version."""
        try:
            # Delete current model
            subprocess.run(
                ["ollama", "rm", self.config.model_name],
                capture_output=True,
                timeout=60
            )

            # Rename backup to current
            subprocess.run(
                ["ollama", "cp", previous_model_name, self.config.model_name],
                capture_output=True,
                timeout=60
            )

            logger.info(f"Rolled back to {previous_model_name}")
            return True

        except Exception as e:
            logger.error(f"Rollback failed: {e}")
            return False

    def backup_current(self) -> Optional[str]:
        """Backup current model before deployment."""
        backup_name = f"{self.config.model_name}-backup-{int(time.time())}"

        try:
            result = subprocess.run(
                ["ollama", "cp", self.config.model_name, backup_name],
                capture_output=True,
                text=True,
                timeout=120
            )

            if result.returncode == 0:
                logger.info(f"Backed up current model to {backup_name}")
                return backup_name
            else:
                logger.warning("No existing model to backup")
                return None

        except Exception as e:
            logger.warning(f"Backup failed (model may not exist): {e}")
            return None

Deployment Validation

Create deployment/validation/health_check.py:

from dataclasses import dataclass
from typing import Optional
import requests
import time
import logging

logger = logging.getLogger(__name__)

@dataclass
class HealthCheckResult:
    healthy: bool
    response_time_ms: float
    error: Optional[str] = None

def check_model_health(
    base_url: str,
    model_name: str,
    timeout: int = 30
) -> HealthCheckResult:
    """Check if model is responding correctly."""
    start_time = time.time()

    try:
        # Test generation endpoint
        response = requests.post(
            f"{base_url}/api/generate",
            json={
                "model": model_name,
                "prompt": "Hello",
                "stream": False
            },
            timeout=timeout
        )

        response_time = (time.time() - start_time) * 1000

        if response.status_code == 200:
            data = response.json()
            if "response" in data and len(data["response"]) > 0:
                return HealthCheckResult(
                    healthy=True,
                    response_time_ms=response_time
                )
            else:
                return HealthCheckResult(
                    healthy=False,
                    response_time_ms=response_time,
                    error="Empty response"
                )
        else:
            return HealthCheckResult(
                healthy=False,
                response_time_ms=response_time,
                error=f"HTTP {response.status_code}"
            )

    except requests.Timeout:
        return HealthCheckResult(
            healthy=False,
            response_time_ms=timeout * 1000,
            error="Request timed out"
        )
    except requests.RequestException as e:
        return HealthCheckResult(
            healthy=False,
            response_time_ms=(time.time() - start_time) * 1000,
            error=str(e)
        )

Create deployment/validation/smoke_test.py:

from dataclasses import dataclass
from typing import List, Optional
import requests
import time
import logging

logger = logging.getLogger(__name__)

@dataclass
class SmokeTestResult:
    passed: bool
    prompts_tested: int
    prompts_passed: int
    avg_latency_ms: float
    failures: List[dict]

def run_smoke_tests(
    base_url: str,
    model_name: str,
    prompts: List[str],
    min_response_length: int = 10,
    max_latency_ms: float = 5000
) -> SmokeTestResult:
    """Run smoke tests against deployed model."""
    failures = []
    latencies = []
    passed_count = 0

    for prompt in prompts:
        start_time = time.time()

        try:
            response = requests.post(
                f"{base_url}/api/generate",
                json={
                    "model": model_name,
                    "prompt": prompt,
                    "stream": False
                },
                timeout=30
            )

            latency = (time.time() - start_time) * 1000
            latencies.append(latency)

            if response.status_code != 200:
                failures.append({
                    "prompt": prompt,
                    "error": f"HTTP {response.status_code}"
                })
                continue

            data = response.json()
            response_text = data.get("response", "")

            # Check response length
            if len(response_text) < min_response_length:
                failures.append({
                    "prompt": prompt,
                    "error": f"Response too short: {len(response_text)} chars"
                })
                continue

            # Check latency
            if latency > max_latency_ms:
                failures.append({
                    "prompt": prompt,
                    "error": f"Latency too high: {latency:.0f}ms"
                })
                continue

            passed_count += 1

        except Exception as e:
            failures.append({
                "prompt": prompt,
                "error": str(e)
            })

    avg_latency = sum(latencies) / len(latencies) if latencies else 0

    return SmokeTestResult(
        passed=len(failures) == 0,
        prompts_tested=len(prompts),
        prompts_passed=passed_count,
        avg_latency_ms=avg_latency,
        failures=failures
    )

Complete Deployment Pipeline

Create deployment/automation/pipeline.py:

from pathlib import Path
from dataclasses import dataclass
from typing import Optional, Dict, Any
import json
import logging
from datetime import datetime

from ..config import DeploymentConfig, load_deployment_config
from ..export.gguf_exporter import GGUFExporter
from ..ollama.deployer import OllamaDeployer
from ..validation.health_check import check_model_health
from ..validation.smoke_test import run_smoke_tests

logger = logging.getLogger(__name__)

@dataclass
class DeploymentPipelineResult:
    success: bool
    stages_completed: list
    model_name: str
    total_time_seconds: float
    stage_results: Dict[str, Any]
    errors: list

class DeploymentPipeline:
    """Complete deployment automation pipeline."""

    def __init__(self, config: DeploymentConfig):
        self.config = config
        self.exporter = GGUFExporter()
        self.deployer = OllamaDeployer(config.ollama)
        self.backup_model: Optional[str] = None

    def run(self, model_path: Path) -> DeploymentPipelineResult:
        """Execute complete deployment pipeline."""
        import time

        start_time = time.time()
        stages_completed = []
        stage_results = {}
        errors = []

        logger.info(f"Starting deployment pipeline for {model_path}")

        # Stage 1: Backup current model (if exists)
        if self.config.enable_rollback:
            self.backup_model = self.deployer.backup_current()
            if self.backup_model:
                stages_completed.append("backup")
                stage_results["backup"] = {"model": self.backup_model}

        # Stage 2: Export to GGUF
        logger.info("Exporting model to GGUF...")
        export_result = self.exporter.export(
            model_path=model_path,
            output_dir=self.config.export.output_dir,
            quantization=self.config.export.quantization
        )

        stage_results["export"] = {
            "success": export_result.success,
            "output_path": str(export_result.output_path) if export_result.output_path else None,
            "size_gb": export_result.file_size_gb,
            "time_seconds": export_result.export_time_seconds
        }

        if not export_result.success:
            errors.append(f"Export failed: {export_result.error}")
            return self._create_result(stages_completed, stage_results, errors, time.time() - start_time)

        stages_completed.append("export")
        logger.info(f"Export complete: {export_result.output_path}")

        # Stage 3: Deploy to Ollama
        logger.info("Deploying to Ollama...")
        deploy_result = self.deployer.deploy(
            gguf_path=export_result.output_path,
            modelfile_dir=self.config.export.output_dir
        )

        stage_results["deploy"] = {
            "success": deploy_result.success,
            "model_name": deploy_result.model_name,
            "time_seconds": deploy_result.deployment_time_seconds
        }

        if not deploy_result.success:
            errors.append(f"Deploy failed: {deploy_result.error}")
            self._rollback_if_enabled(errors)
            return self._create_result(stages_completed, stage_results, errors, time.time() - start_time)

        stages_completed.append("deploy")
        logger.info(f"Deployment complete: {deploy_result.model_name}")

        # Stage 4: Health check
        logger.info("Running health check...")
        health_result = check_model_health(
            base_url=self.config.ollama.base_url,
            model_name=self.config.ollama.model_name,
            timeout=self.config.validation.health_check_timeout
        )

        stage_results["health_check"] = {
            "healthy": health_result.healthy,
            "response_time_ms": health_result.response_time_ms,
            "error": health_result.error
        }

        if not health_result.healthy:
            errors.append(f"Health check failed: {health_result.error}")
            self._rollback_if_enabled(errors)
            return self._create_result(stages_completed, stage_results, errors, time.time() - start_time)

        stages_completed.append("health_check")
        logger.info(f"Health check passed: {health_result.response_time_ms:.0f}ms")

        # Stage 5: Smoke tests
        if self.config.validation.smoke_test_prompts:
            logger.info("Running smoke tests...")
            smoke_result = run_smoke_tests(
                base_url=self.config.ollama.base_url,
                model_name=self.config.ollama.model_name,
                prompts=self.config.validation.smoke_test_prompts,
                min_response_length=self.config.validation.min_response_length,
                max_latency_ms=self.config.validation.max_latency_ms
            )

            stage_results["smoke_test"] = {
                "passed": smoke_result.passed,
                "prompts_tested": smoke_result.prompts_tested,
                "prompts_passed": smoke_result.prompts_passed,
                "avg_latency_ms": smoke_result.avg_latency_ms,
                "failures": smoke_result.failures
            }

            if not smoke_result.passed:
                errors.append(f"Smoke tests failed: {smoke_result.failures}")
                self._rollback_if_enabled(errors)
                return self._create_result(stages_completed, stage_results, errors, time.time() - start_time)

            stages_completed.append("smoke_test")
            logger.info(f"Smoke tests passed: {smoke_result.prompts_passed}/{smoke_result.prompts_tested}")

        # Success - clean up backup
        if self.backup_model:
            logger.info(f"Deployment successful. Backup available at: {self.backup_model}")

        return self._create_result(stages_completed, stage_results, errors, time.time() - start_time)

    def _rollback_if_enabled(self, errors: list):
        """Attempt rollback if enabled and backup exists."""
        if self.config.enable_rollback and self.backup_model:
            logger.info("Attempting rollback...")
            if self.deployer.rollback(self.backup_model):
                errors.append(f"Rolled back to {self.backup_model}")
            else:
                errors.append("Rollback failed")

    def _create_result(
        self,
        stages_completed: list,
        stage_results: Dict[str, Any],
        errors: list,
        total_time: float
    ) -> DeploymentPipelineResult:
        """Create pipeline result."""
        result = DeploymentPipelineResult(
            success=len(errors) == 0,
            stages_completed=stages_completed,
            model_name=self.config.ollama.model_name,
            total_time_seconds=total_time,
            stage_results=stage_results,
            errors=errors
        )

        # Save result
        result_path = self.config.export.output_dir / "deployment_result.json"
        with open(result_path, "w") as f:
            json.dump({
                "success": result.success,
                "stages_completed": result.stages_completed,
                "model_name": result.model_name,
                "total_time_seconds": result.total_time_seconds,
                "stage_results": result.stage_results,
                "errors": result.errors,
                "timestamp": datetime.now().isoformat()
            }, f, indent=2)

        return result

CLI Entry Point

Create deployment/scripts/deploy.py:

#!/usr/bin/env python
import argparse
import logging
from pathlib import Path

from deployment.config import load_deployment_config
from deployment.automation.pipeline import DeploymentPipeline

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

def main():
    parser = argparse.ArgumentParser(description="Deploy model to Ollama")
    parser.add_argument("model_path", type=Path, help="Path to trained model")
    parser.add_argument("--config", default="configs/deployment_config.yaml")
    args = parser.parse_args()

    config = load_deployment_config(args.config)
    pipeline = DeploymentPipeline(config)

    print(f"Deploying {args.model_path}")
    print(f"Target: {config.ollama.model_name}")
    print()

    result = pipeline.run(args.model_path)

    print("\n" + "="*50)
    print("DEPLOYMENT RESULT")
    print("="*50)
    print(f"Success: {result.success}")
    print(f"Stages: {', '.join(result.stages_completed)}")
    print(f"Time: {result.total_time_seconds:.1f}s")

    if result.errors:
        print(f"\nErrors:")
        for error in result.errors:
            print(f"  - {error}")

if __name__ == "__main__":
    main()

Output:

Deploying ./artifacts/dpo/aligned
Target: task-api-model

2026-01-02 15:30:00 - Exporting model to GGUF...
2026-01-02 15:32:45 - Export complete: ./exports/task-api-model-Q4_K_M.gguf
2026-01-02 15:32:45 - Deploying to Ollama...
2026-01-02 15:33:10 - Deployment complete: task-api-model
2026-01-02 15:33:10 - Running health check...
2026-01-02 15:33:12 - Health check passed: 1823ms
2026-01-02 15:33:12 - Running smoke tests...
2026-01-02 15:33:25 - Smoke tests passed: 3/3

==================================================
DEPLOYMENT RESULT
==================================================
Success: True
Stages: backup, export, deploy, health_check, smoke_test
Time: 205.3s

Try With AI

Prompt 1: Add Multi-Environment Deployment

I want to deploy to multiple environments (dev, staging, prod) with:

1. Different quantization levels per environment (dev=F16, staging=Q8, prod=Q4)
2. Different validation thresholds (stricter for prod)
3. Promotion workflow (dev → staging → prod)
4. Environment-specific Ollama configurations

Design the multi-environment deployment system and show how to modify
my deployment pipeline to support this.

What you're learning: Designing multi-environment deployment workflows.

Prompt 2: Implement Canary Deployment

I want to add canary deployment capability:

Deploy new model alongside current
Route 10% of traffic to new model
Monitor error rates and latency
Gradually increase traffic if metrics are good
Auto-rollback if metrics degrade

Show how to implement this with Ollama and a simple load balancer.

What you're learning: Implementing gradual rollout strategies for model deployments.

Prompt 3: Add Deployment Notifications

I want to be notified about deployment status. Add integrations for:

1. Slack notifications (start, progress, success/failure)
2. Email summary on completion
3. PagerDuty alerts on failure
4. Webhook for custom integrations

Show the notification system design and integration with my pipeline.

What you're learning: Building observability into deployment workflows.

The Deployment Pipeline​

Project Structure​

Export Configuration​

Model Export Automation​

Modelfile Generation​

Ollama Deployer​

Deployment Validation​

Complete Deployment Pipeline​

CLI Entry Point​

Try With AI​

Prompt 1: Add Multi-Environment Deployment​

Prompt 2: Implement Canary Deployment​

Prompt 3: Add Deployment Notifications​