Skip to main content
Updated Feb 23, 2026

Production Kafka with Strimzi

Your development cluster works perfectly on Docker Desktop. One broker, ephemeral storage, no authentication. But production is a different world.

In production, Kafka clusters must survive broker failures without data loss. They must encrypt traffic to prevent eavesdropping. They must authenticate clients to prevent unauthorized access. And they must have enough resources to handle peak load without throttling.

The gap between your development setup and production readiness is significant. Lesson 4 got Kafka running quickly. This lesson makes it production-grade.

Development vs Production: The Gap

Before diving into configuration, understand what changes between environments:

AspectDevelopmentProductionWhy It Matters
Node ArchitectureSingle dual-role nodeSeparate controller (3) + broker (3+) poolsController failures don't affect message processing; scale independently
StorageEphemeralPersistent (SSD-backed PVCs)Data survives pod restarts and node failures
ReplicationFactor 1Factor 3, min.insync.replicas 2Survive broker failures without data loss
EncryptionNone (plain listener)TLS everywherePrevent network-level eavesdropping
AuthenticationNoneSCRAM-SHA-512 or mTLSPrevent unauthorized client access
AuthorizationNoneTopic-level ACLsLeast-privilege access control
ResourcesDefault (minimal)Explicit CPU/memory limitsPredictable performance, prevent noisy neighbors

Every difference addresses a specific production failure mode. Let's configure each one.

Separate Controller and Broker Node Pools

In Lesson 4, you deployed a single node running both controller and broker roles. This works for development but creates problems in production:

  1. Blast radius: A controller failure takes down message processing
  2. Scaling constraints: Controllers don't need to scale like brokers
  3. Resource contention: Controller metadata operations compete with broker I/O

Production clusters separate these roles into dedicated node pools.

Controller Node Pool

Controllers manage cluster metadata through Raft consensus. They don't handle client traffic.

# kafka-controller-pool.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: controllers
namespace: kafka
labels:
strimzi.io/cluster: task-events
spec:
replicas: 3
roles:
- controller
storage:
type: persistent-claim
size: 10Gi
class: standard # Use your cluster's storage class
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
jvmOptions:
-Xms: 512m
-Xmx: 1g

Key production settings:

FieldValueRationale
replicas: 3Odd number for Raft quorum3 controllers tolerate 1 failure; 5 tolerates 2
roles: [controller]Controller onlyDedicated to metadata, not message handling
storage.type: persistent-claimDurable storageMetadata survives pod restarts
storage.size: 10GiModest sizeControllers store metadata, not messages
resources.limitsExplicit boundsPrevent runaway memory; enable capacity planning

Broker Node Pool

Brokers handle producer/consumer traffic and store message data. They scale based on throughput requirements.

# kafka-broker-pool.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: brokers
namespace: kafka
labels:
strimzi.io/cluster: task-events
spec:
replicas: 3
roles:
- broker
storage:
type: persistent-claim
size: 100Gi
class: standard
resources:
requests:
memory: 4Gi
cpu: 1000m
limits:
memory: 8Gi
cpu: 2000m
jvmOptions:
-Xms: 2g
-Xmx: 4g

Key production settings:

FieldValueRationale
replicas: 3Minimum for RF=3Each partition has 3 copies across brokers
roles: [broker]Broker onlyDedicated to message handling
storage.size: 100GiProduction sizingSized for retention period and throughput
resources.limits.memory: 8GiGenerous memoryJVM heap + page cache for read performance
jvmOptions.-Xmx: 4gHalf of limitLeave memory for OS page cache

Why Separate Pools?

The separation creates independent failure domains:

┌─────────────────────────────────────────────────────────────┐
│ Controller Pool (Metadata) │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ ctrl-0 │ │ ctrl-1 │ │ ctrl-2 │ │
│ │ (leader) │ │ (follower)│ │ (follower)│ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ Raft consensus for cluster metadata │
├─────────────────────────────────────────────────────────────┤
│ Broker Pool (Messages) │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────┐ │
│ │ broker-0 │ │ broker-1 │ │ broker-2 │ │ broker-3│ │
│ │ 100Gi SSD │ │ 100Gi SSD │ │ 100Gi SSD │ │ 100Gi │ │
│ └───────────┘ └───────────┘ └───────────┘ └─────────┘ │
│ Partition replicas distributed │
└─────────────────────────────────────────────────────────────┘

Benefits:

  • Scale brokers independently (add broker-3, broker-4 for more throughput)
  • Controller failure doesn't stop message processing (existing brokers continue)
  • Different resource profiles (controllers need less memory, brokers need more)

TLS Encryption

In development, you used the plain listener on port 9092. Production traffic should be encrypted.

Update Kafka CRD for TLS

# kafka-cluster-production.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: task-events
namespace: kafka
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
version: 4.1.1
metadataVersion: 4.1-IV0
listeners:
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: scram-sha-512
- name: external
port: 9094
type: nodeport
tls: true
authentication:
type: scram-sha-512
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
auto.create.topics.enable: false
entityOperator:
topicOperator: {}
userOperator: {}

Key security settings:

FieldValuePurpose
listeners[].tls: trueEnabledEncrypt traffic with TLS 1.2+
listeners[].authentication.typescram-sha-512Require username/password auth
min.insync.replicas: 22 of 3Require 2 acks for durability
auto.create.topics.enable: falseDisabledPrevent accidental topic creation

How Strimzi Handles Certificates

Strimzi automatically manages TLS certificates:

  1. Cluster CA: Signs broker certificates; clients verify server identity
  2. Clients CA: Signs client certificates (for mTLS); brokers verify client identity
  3. Auto-renewal: Strimzi rotates certificates before expiration

You don't need to manually create certificates. Strimzi's Entity Operator handles the PKI lifecycle.

To extract the CA certificate for clients:

# Get cluster CA certificate
kubectl get secret task-events-cluster-ca-cert -n kafka \
-o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt

Clients use this CA certificate to verify they're connecting to the real Kafka cluster, not an impersonator.

SCRAM-SHA-512 Authentication

TLS encrypts traffic but doesn't identify clients. Add authentication so only authorized clients can connect.

Create KafkaUser with ACLs

# kafka-user-task-api.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: task-api
namespace: kafka
labels:
strimzi.io/cluster: task-events
spec:
authentication:
type: scram-sha-512
authorization:
type: simple
acls:
# Produce to task-* topics
- resource:
type: topic
name: task-
patternType: prefix
operations:
- Write
- Describe
host: "*"
# Consume from task-* topics (for testing)
- resource:
type: topic
name: task-
patternType: prefix
operations:
- Read
- Describe
host: "*"
# Consumer group for task-api
- resource:
type: group
name: task-api-
patternType: prefix
operations:
- Read
host: "*"

ACL breakdown:

ResourcePatternOperationsPurpose
topic: task-prefixWrite, DescribeProduce to any topic starting with "task-"
topic: task-prefixRead, DescribeConsume from any topic starting with "task-"
group: task-api-prefixReadUse consumer groups starting with "task-api-"

Apply the user:

kubectl apply -f kafka-user-task-api.yaml

Output:

kafkauser.kafka.strimzi.io/task-api created

Retrieve Credentials

Strimzi stores the generated password in a Kubernetes Secret:

# Get the password
kubectl get secret task-api -n kafka \
-o jsonpath='{.data.password}' | base64 -d

For a Python producer, you'd configure authentication like this:

from confluent_kafka import Producer

producer = Producer({
'bootstrap.servers': 'task-events-kafka-bootstrap:9093',
'security.protocol': 'SASL_SSL',
'sasl.mechanism': 'SCRAM-SHA-512',
'sasl.username': 'task-api',
'sasl.password': '<password-from-secret>',
'ssl.ca.location': '/path/to/ca.crt'
})

Critical settings for authenticated connections:

ConfigValuePurpose
security.protocolSASL_SSLTLS encryption + SASL authentication
sasl.mechanismSCRAM-SHA-512Password-based auth (not plaintext)
ssl.ca.locationPath to ca.crtVerify server certificate

Resource Limits and Requests

Production clusters need explicit resource boundaries. Without them:

  • Pods get OOMKilled during traffic spikes
  • Other workloads starve when Kafka uses all available resources
  • Capacity planning becomes guesswork

Sizing Guidelines

ComponentMemory RequestMemory LimitCPU RequestCPU Limit
Controller1Gi2Gi500m1000m
Broker (small)4Gi8Gi1000m2000m
Broker (medium)8Gi16Gi2000m4000m
Broker (large)16Gi32Gi4000m8000m

JVM heap sizing rules:

  • Set -Xmx to half the memory limit
  • Leave the other half for OS page cache (critical for read performance)
  • Set -Xms equal to -Xmx for predictable performance

Example for a broker with 8Gi memory limit:

jvmOptions:
-Xms: 4g
-Xmx: 4g
gcLoggingEnabled: true

Persistent Storage Configuration

Production data must survive pod restarts. Configure storage classes that match your cloud provider:

AWS EKS

storage:
type: persistent-claim
size: 500Gi
class: gp3
deleteClaim: false

GKE

storage:
type: persistent-claim
size: 500Gi
class: premium-rwo
deleteClaim: false

Azure AKS

storage:
type: persistent-claim
size: 500Gi
class: managed-premium
deleteClaim: false

Key settings:

FieldValuePurpose
deleteClaim: falsePreserve PVCsData survives accidental Kafka CRD deletion
sizeBased on retentionCalculate: throughput x retention period

Storage Sizing Formula

Required Storage = (Messages/sec × Avg Message Size × Retention Seconds × Replication Factor) / Broker Count

Example:
- 10,000 messages/second
- 1KB average message size
- 7 days (604,800 seconds) retention
- Replication factor 3
- 3 brokers

Storage = (10,000 × 1KB × 604,800 × 3) / 3
= 6,048,000,000 KB / 3
= 2,016 GB per broker

Round up and add 20% headroom for operational flexibility.

Complete Production Configuration

Here's the full production deployment combining all security and reliability settings:

# production-kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: controllers
namespace: kafka
labels:
strimzi.io/cluster: task-events
spec:
replicas: 3
roles:
- controller
storage:
type: persistent-claim
size: 20Gi
class: standard
deleteClaim: false
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
jvmOptions:
-Xms: 512m
-Xmx: 1g
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: brokers
namespace: kafka
labels:
strimzi.io/cluster: task-events
spec:
replicas: 3
roles:
- broker
storage:
type: persistent-claim
size: 100Gi
class: standard
deleteClaim: false
resources:
requests:
memory: 4Gi
cpu: 1000m
limits:
memory: 8Gi
cpu: 2000m
jvmOptions:
-Xms: 2g
-Xmx: 4g
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: task-events
namespace: kafka
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
version: 4.1.1
metadataVersion: 4.1-IV0
listeners:
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: scram-sha-512
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
auto.create.topics.enable: false
log.retention.hours: 168
log.segment.bytes: 1073741824
num.partitions: 6
entityOperator:
topicOperator:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
userOperator:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m

Verifying Production Readiness

After applying the production configuration:

# Check all pods are running
kubectl get pods -n kafka

# Verify node pools
kubectl get kafkanodepools -n kafka

# Check Kafka status
kubectl get kafka task-events -n kafka -o yaml | grep -A 20 status:

Expected output:

NAME                                        READY   STATUS    RESTARTS   AGE
task-events-controllers-0 1/1 Running 0 5m
task-events-controllers-1 1/1 Running 0 5m
task-events-controllers-2 1/1 Running 0 5m
task-events-brokers-0 1/1 Running 0 4m
task-events-brokers-1 1/1 Running 0 4m
task-events-brokers-2 1/1 Running 0 4m
task-events-entity-operator-... 2/2 Running 0 3m

Migration Path: Development to Production

If you have an existing development cluster, here's the migration approach:

  1. Create new production node pools alongside existing dual-role pool
  2. Apply updated Kafka CRD with new listeners and replication settings
  3. Wait for Strimzi to redistribute partitions to new brokers
  4. Create KafkaUser resources for all clients
  5. Update client configurations to use TLS + authentication
  6. Remove development node pool once all traffic migrated

Strimzi handles partition redistribution automatically when you add brokers. The migration can be done with zero downtime.


Reflect on Your Skill

You built a kafka-events skill in Lesson 0. Test and improve it based on what you learned.

Test Your Skill

Using my kafka-events skill, configure a production Kafka cluster with TLS encryption, SCRAM authentication, and node pools.
Does my skill generate Strimzi CRDs with proper security and resource allocation?

Identify Gaps

Ask yourself:

  • Did my skill include TLS listener configuration and certificate management?
  • Did it cover SCRAM user authentication and ACL setup?

Improve Your Skill

If you found gaps:

My kafka-events skill is missing production Kafka configuration (TLS, SCRAM, node pools, resource quotas).
Update it to include how to secure and scale Kafka clusters in production.

Try With AI

You've configured production Kafka with security and reliability features. Now explore how to validate and optimize your configuration.

Prompt 1: Security Audit

I've configured production Kafka with:
- Separate controller (3 nodes) and broker (3 nodes) pools
- TLS on port 9093
- SCRAM-SHA-512 authentication
- KafkaUser with topic-prefix ACLs

Review my security configuration:
1. What attack vectors am I still exposed to?
2. How would you improve the ACL configuration for least privilege?
3. What monitoring should I add to detect unauthorized access attempts?

Start by asking about my specific security requirements (compliance,
multi-tenancy, external access) so you can give targeted recommendations.

What you're learning: Security configuration involves tradeoffs between usability and protection. AI can help you understand your threat model and prioritize hardening efforts.

Prompt 2: Capacity Planning

I'm planning a production Kafka cluster for an event-driven Task API with:
- Expected load: 5,000 events/second at peak
- Average event size: 2KB
- Retention: 30 days
- Must survive 1 broker failure without data loss

Help me size the cluster:
1. How many brokers do I need?
2. What storage per broker?
3. What memory and CPU?

Walk me through your calculations so I can adjust them as our
requirements change.

What you're learning: Capacity planning requires understanding the relationship between throughput, retention, replication, and resources. AI can teach you the formulas while applying them to your specific scenario.

Prompt 3: Troubleshooting Production Issues

My production Kafka cluster is experiencing issues:
- Producers getting NotEnoughReplicasException intermittently
- Consumer lag increasing on some partitions
- One broker showing higher disk usage than others

Here's my configuration:
- 3 brokers, replication factor 3, min.insync.replicas 2
- Storage: 100Gi per broker, 60% used on broker-2

Help me diagnose:
1. What's likely causing each symptom?
2. What commands should I run to investigate?
3. What's the priority order for fixing these issues?

What you're learning: Production debugging requires correlating symptoms with root causes. AI can help you develop a systematic troubleshooting methodology for distributed systems.

Safety note: Always test configuration changes in a staging environment before production. Incorrect authentication or replication settings can cause client failures or data loss. Keep your development cluster configuration separate so you can iterate quickly without risking production.