Skip to main content

Lab: Inference on T4

This is a hands-on lab. By the end, you'll have a working Colab notebook that loads Llama-3-8B-Instruct with 4-bit quantization and generates responses to Task API prompts.

No theory in this lesson—just code, outputs, and troubleshooting. Keep your previous lessons open for reference.

Prerequisites

Before starting:

  • Google account for Colab access
  • Hugging Face account (free) for model access
  • Completed Lessons 4-6 (VRAM budgeting, quantization, optimization)

Part 1: Colab Setup

Step 1: Create a New Notebook

  1. Go to colab.research.google.com
  2. Click File -> New Notebook
  3. Rename to ch62-inference-lab.ipynb

Step 2: Enable GPU Runtime

  1. Click Runtime -> Change runtime type
  2. Select T4 GPU from the dropdown
  3. Click Save

Step 3: Verify GPU Access

Create your first cell and run it:

# Cell 1: GPU Verification
import torch

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
print(f"CUDA version: {torch.version.cuda}")
else:
print("ERROR: No GPU detected. Check Runtime -> Change runtime type")

Output:

CUDA available: True
GPU: Tesla T4
VRAM: 15.0 GB
CUDA version: 12.1

If you see "No GPU detected," go back to Runtime settings.

Part 2: Install Dependencies

Step 4: Install Required Packages

# Cell 2: Install Dependencies
# This takes 2-3 minutes

!pip install -q transformers accelerate bitsandbytes
!pip install -q huggingface_hub

print("Installation complete!")

Output:

Installation complete!

If you see deprecation warnings, that's normal—they don't affect functionality.

Step 5: Hugging Face Authentication

You need to authenticate to access Llama-3-8B (gated model).

# Cell 3: Hugging Face Login
from huggingface_hub import login

# Option 1: Interactive login (recommended for first time)
login()

# Option 2: Use token directly (if you have it saved)
# login(token="hf_your_token_here")

What happens:

  1. Colab shows a text input box
  2. Go to huggingface.co/settings/tokens
  3. Create a new token with "Read" permissions
  4. Paste the token and press Enter

Output:

Token has been saved to /root/.huggingface/token
Login successful

Step 6: Verify Model Access

Before loading the large model, verify your access:

# Cell 4: Verify Access to Llama-3
from huggingface_hub import model_info

try:
info = model_info("meta-llama/Meta-Llama-3-8B-Instruct")
print(f"Model access verified: {info.id}")
print(f"Downloads: {info.downloads:,}")
except Exception as e:
print(f"Access error: {e}")
print("\nTo fix: Go to https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct")
print("and accept the license agreement.")

Output:

Model access verified: meta-llama/Meta-Llama-3-8B-Instruct
Downloads: 12,345,678

If you get an access error, click the link and accept the Llama 3 license on Hugging Face.

Part 3: Load the Quantized Model

Step 7: Configure Quantization

# Cell 5: Quantization Configuration
from transformers import BitsAndBytesConfig
import torch

# QLoRA-ready 4-bit configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16, # T4 doesn't support bfloat16
bnb_4bit_use_double_quant=True,
)

print("Quantization config ready")
print(f" - Precision: 4-bit NF4")
print(f" - Compute dtype: float16")
print(f" - Double quantization: enabled")

Output:

Quantization config ready
- Precision: 4-bit NF4
- Compute dtype: float16
- Double quantization: enabled

Step 8: Load Model and Tokenizer

This is the memory-intensive step. Watch the progress carefully.

# Cell 6: Load Model (takes 3-5 minutes)
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

print(f"Loading {model_name}...")
print("This takes 3-5 minutes on first run.\n")

start_time = time.time()

# Load tokenizer (fast)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
print(f"Tokenizer loaded in {time.time() - start_time:.1f}s")

# Load model with quantization (slow)
model_start = time.time()
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True,
)
print(f"Model loaded in {time.time() - model_start:.1f}s")

# Report memory usage
vram_used = torch.cuda.memory_allocated() / 1024**3
vram_reserved = torch.cuda.memory_reserved() / 1024**3
print(f"\nVRAM allocated: {vram_used:.2f} GB")
print(f"VRAM reserved: {vram_reserved:.2f} GB")
print(f"Headroom: {15.0 - vram_reserved:.2f} GB")

Output:

Loading meta-llama/Meta-Llama-3-8B-Instruct...
This takes 3-5 minutes on first run.

Tokenizer loaded in 2.3s
Model loaded in 187.4s

VRAM allocated: 4.89 GB
VRAM reserved: 5.12 GB
Headroom: 9.88 GB

Key insight: The 8B model uses only ~5GB with 4-bit quantization, matching our calculation from Lesson 4 (8B x 0.5 bytes = 4GB base + overhead).

Step 9: Verify Model Works

Quick sanity check before moving to Task API prompts:

# Cell 7: Quick Test
test_prompt = "The capital of France is"

inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=20,
do_sample=False, # Deterministic for testing
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {test_prompt}")
print(f"Response: {result}")

# Verify output is reasonable
if "Paris" in result:
print("\nModel verification PASSED")
else:
print("\nModel verification: Check output quality")

Output:

Prompt: The capital of France is
Response: The capital of France is Paris. Paris is the largest city in France and the

Model verification PASSED

Part 4: Task API Inference

Now let's use the model for our running example: Task API.

Step 10: Define the Task Assistant Prompt

# Cell 8: Task API Prompt Template
def create_task_prompt(user_message: str) -> str:
"""Format prompt for Llama-3-8B-Instruct."""
system_message = """You are a helpful task management assistant.
When users describe tasks, you:
1. Create a structured task with title, description, due date, and priority
2. Ask clarifying questions if information is missing
3. Suggest relevant tags or categories"""

# Llama 3 Instruct format
prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
return prompt

# Test the template
test_message = "I need to review the Q4 budget by Friday"
print(create_task_prompt(test_message))

Output:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful task management assistant.
When users describe tasks, you:
1. Create a structured task with title, description, due date, and priority
2. Ask clarifying questions if information is missing
3. Suggest relevant tags or categories<|eot_id|><|start_header_id|>user<|end_header_id|>

I need to review the Q4 budget by Friday<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Step 11: Create Inference Function

# Cell 9: Inference Function
def generate_task_response(user_message: str, max_tokens: int = 200) -> str:
"""Generate a task management response."""
prompt = create_task_prompt(user_message)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)

# Decode only the new tokens (skip the prompt)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)

return response.strip()

# Test the function
print("Testing inference function...")
response = generate_task_response("I need to review the Q4 budget by Friday")
print(f"\nResponse:\n{response}")

Output:

Testing inference function...

Response:
I'd be happy to help you create a task for reviewing the Q4 budget.

**Task Created:**
- **Title:** Review Q4 Budget
- **Description:** Conduct a thorough review of the quarterly budget for Q4
- **Due Date:** Friday (this week)
- **Priority:** High

A few questions to refine this task:
1. Is this a personal review or will you be presenting findings to a team?
2. Are there specific areas of the budget you want to focus on?
3. Should I set a reminder for Thursday to give you prep time?

Step 12: Run Multiple Task Examples

# Cell 10: Test Multiple Scenarios
test_cases = [
"Schedule a team standup for every Monday at 9am",
"Remind me to call the client about the proposal",
"I need to finish the presentation but I'm not sure when it's due",
]

print("=" * 60)
print("Task API Inference Examples")
print("=" * 60)

for i, task in enumerate(test_cases, 1):
print(f"\n--- Example {i} ---")
print(f"User: {task}")
print(f"\nAssistant:")
response = generate_task_response(task)
print(response)
print()

Output:

============================================================
Task API Inference Examples
============================================================

--- Example 1 ---
User: Schedule a team standup for every Monday at 9am