I discovered I was wasting 31 CPU cores and 63GB of RAM only after implementing proper observability. Before SigNoz, I was flying blind - guessing at performance issues and discovering problems only when things broke. This article covers deploying a complete observability stack with SigNoz, OpenTelemetry, and Fluent Bit, giving you the visibility needed to run a production cluster.
Why SigNoz for Kubernetes Observability
After evaluating Prometheus + Grafana, Elastic Stack, and SigNoz, here's why SigNoz won:
Single Platform for Everything
- Metrics, traces, and logs in one UI
- No need to correlate across multiple tools
- Multiple query interfaces: PromQL, visual Query Builder, and ClickHouse SQL
OpenTelemetry Native
- Built on open standards, not proprietary formats
- Easy migration if needed
- Huge ecosystem support
Cost Effective
- Open source with no per-node licensing
- Efficient columnar storage (ClickHouse)
- Better compression than Elasticsearch
Kubernetes Optimized
- Auto-discovery of services
- Pre-built dashboards for Kubernetes
- Native support for container metrics
Pre-Installation Planning
Resource Requirements
Based on my experience (Linear issue COM-184) monitoring a 7-node cluster with multiple applications:
SigNoz Components (Actual Usage):
ClickHouse: 2 CPU, 8GB RAM # Database for metrics/traces/logs
SigNoz (unified): 600m CPU, 1.5GB RAM # Query Service + Frontend + Alertmanager (v0.76+)
OTEL Collector: 1 CPU, 2GB RAM # Collects and processes telemetry
Total: ~4 CPU, 12GB RAM for monitoring 7 nodes
Additional: ~2 CPU, 4GB RAM total for DaemonSet collectors across all 7 nodes (~0.3 CPU, 0.6GB per node)
Day-2 Operations: Resource Monitoring
# Monitor SigNoz resource usage over time
kubectl top pods -n signoz --sort-by=memory
kubectl top pods -n signoz --sort-by=cpu
# Check storage usage (ClickHouse grows quickly)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
clickhouse-client --query "SELECT
database,
formatReadableSize(sum(data_compressed_bytes)) as compressed,
formatReadableSize(sum(data_uncompressed_bytes)) as uncompressed
FROM system.parts
GROUP BY database ORDER BY sum(data_compressed_bytes) DESC"
# Expected output showing storage by database:
# signoz_metrics 2.1 GiB 8.3 GiB
# signoz_traces 1.4 GiB 4.2 GiB
# signoz_logs 3.2 GiB 12.1 GiB
Storage Planning
Data Retention Planning:
Metrics: 30 days @ 10GB/day = 300GB
# Reason: Need month-over-month comparisons for capacity planning
Traces: 7 days @ 5GB/day = 35GB
# Reason: Traces are for immediate debugging, not long-term analysis
Logs: 14 days @ 20GB/day = 280GB
# Reason: Logs needed for security analysis and compliance
Total Storage: ~615GB (provision 1TB for growth)
Storage Performance Requirements:
Read IOPS: 5000+ (queries scan historical data)
Write IOPS: 2000+ (constant ingestion)
Sequential Read: 200+ MB/s (large query responses)
Installing SigNoz
Deploy with Helm
# signoz-values.yaml - Updated for SigNoz v0.76+ unified binary
global:
storageClass: ceph-block-fast
clusterName: homelab-cluster
clickhouse:
enabled: true
replicaCount: 1 # Single instance for homelab
persistence:
enabled: true
size: 500Gi
resources:
requests:
cpu: 1000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
configuration:
settings:
# Performance tuning for homelab
max_memory_usage: "6000000000" # 6GB
max_memory_usage_for_user: "5000000000" # 5GB
max_bytes_before_external_group_by: "3000000000" # 3GB
distributed_aggregation_memory_efficient: "1"
# Compression settings (Operator format)
compression/case/method: "zstd"
# Unified signoz service (v0.76+: Query Service + Frontend + Alertmanager)
signoz:
enabled: true
replicaCount: 2 # HA for queries
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
# Note: Retention is now configured in the UI (General → Settings)
# New retention settings apply only to newly ingested data
# Configuration via environment variables (unified binary)
additionalEnvs:
# Enable dot metrics support for OTel metric names (required for dashboard queries)
- name: DOT_METRICS_ENABLED
value: "true"
# Alerting configuration
- name: SIGNOZ_ALERTMANAGER_SIGNOZ_EXTERNAL__URL
value: "https://signoz.homelab.example"
- name: SIGNOZ_ALERTMANAGER_SIGNOZ_GLOBAL_SMTP__SMARTHOST
value: "smtp.homelab.example:587"
- name: SIGNOZ_ALERTMANAGER_SIGNOZ_GLOBAL_SMTP__FROM
value: "[email protected]"
# Webhook receiver for alerts
- name: SIGNOZ_ALERTMANAGER_SIGNOZ_RECEIVERS__0__WEBHOOK__CONFIGS__0__URL
value: "http://webhook-receiver.monitoring.svc.cluster.local/alerts"
otelCollector:
enabled: true
replicaCount: 1 # Must be 1 when k8s_cluster receiver is enabled (prevents duplicate metrics)
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
config:
receivers:
# Kubernetes metrics (MUST run on exactly one replica to prevent duplicates)
k8s_cluster:
collection_interval: 30s
node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
# Note: Host metrics are collected by the DaemonSet, not here
# This prevents duplicate collection of the same metrics
# Prometheus receiver for scraping
prometheus:
config:
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: ${1}:${2}
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
processors:
# Add cluster metadata
resource:
attributes:
- key: cluster.name
value: homelab-cluster
action: upsert
- key: environment
value: production
action: upsert
# Memory limiter to prevent OOM (must be first)
memory_limiter:
check_interval: 1s
limit_mib: 3072
spike_limit_mib: 512
# Batch for efficiency
batch:
timeout: 10s
send_batch_size: 10000
SigNoz is a single binary since v0.76; this chart deploys one signoz deployment (UI+API+Alertmanager).
Install SigNoz:
# Create namespace
kubectl create namespace signoz
# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update
# Install SigNoz (get latest chart version)
SIGNOZ_CHART_VERSION=$(helm search repo signoz/signoz --versions -o json | jq -r '.[0].version')
helm show chart signoz/signoz --version "$SIGNOZ_CHART_VERSION" | grep appVersion
# appVersion: 0.94.x ← verify before installing
# Note: v0.94.x requires ClickHouse 25.5.6 - verify your CH image/tag
helm install signoz signoz/signoz \
--namespace signoz \
--version "$SIGNOZ_CHART_VERSION" \
--values signoz-values.yaml
# Wait for rollout
kubectl -n signoz rollout status deployment/signoz
kubectl -n signoz rollout status statefulset/chi-signoz-clickhouse-cluster-0-0
Configuring OpenTelemetry Collection
Deploy OTEL Collector DaemonSet
# otel-collector-daemonset.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: signoz
data:
config.yaml: |
receivers:
# Collect host metrics (requires hostfs mount for accurate node metrics)
# Note: Without hostfs mount + root_path, you'd get container stats instead of actual node stats
hostmetrics:
root_path: /hostfs
collection_interval: 30s
scrapers:
cpu: {}
memory: {}
disk: {}
network: {}
filesystem: {}
load: {}
# Kubernetes events (using k8sobjects - future-proof replacement for deprecated k8s_events)
k8sobjects:
objects:
- name: events
mode: watch
# Note: Container logs are handled by Fluent Bit → OTLP/HTTP on port 4318
# This prevents duplicate log ingestion. Remove this comment if using OTel-only setup.
# Collect Prometheus metrics
prometheus:
config:
scrape_configs:
- job_name: kubelet
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$${1}/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
# Container metrics via cAdvisor (for resource waste analysis)
- job_name: kubelet-cadvisor
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
processors:
# Add Kubernetes attributes
k8sattributes:
auth_type: serviceAccount
passthrough: false
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.node.name
- k8s.pod.name
- k8s.pod.uid
- k8s.pod.start_time
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
# Resource detection
resourcedetection/system:
detectors: [system, env, docker]
timeout: 2s
override: false
# Memory limiting (must be first)
memory_limiter:
check_interval: 1s
limit_mib: 1024
spike_limit_mib: 256
# Batch processing
batch:
timeout: 10s
send_batch_size: 10000
exporters:
# Send to SigNoz
otlp/signoz:
endpoint: signoz-otel-collector.signoz.svc.cluster.local:4317
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 10s
max_elapsed_time: 30s
service:
pipelines:
metrics:
receivers: [hostmetrics, prometheus]
processors: [memory_limiter, k8sattributes, resourcedetection/system, batch]
exporters: [otlp/signoz]
logs:
receivers: [k8sobjects] # Using k8sobjects for events (filelog removed to prevent duplicate ingestion with Fluent Bit)
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp/signoz]
telemetry:
logs:
level: info
metrics:
level: detailed
address: 0.0.0.0:8888
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: signoz
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.132.0
command: ["/otelcol-contrib", "--config=/conf/config.yaml"]
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: config
mountPath: /conf
- name: hostfs
mountPath: /hostfs
readOnly: true
mountPropagation: HostToContainer
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
volumes:
- name: config
configMap:
name: otel-collector-config
- name: hostfs
hostPath:
path: /
tolerations:
- operator: Exists
Deploy the collector:
kubectl apply -f otel-collector-daemonset.yaml
Instrumenting Applications
Auto-Instrumentation with OTEL Operator
# otel-instrumentation.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
namespace: apps
spec:
# Java auto-instrumentation
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://signoz-otel-collector.signoz.svc.cluster.local:4317
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: grpc
- name: OTEL_RESOURCE_ATTRIBUTES
value: "service.namespace=apps,environment=production"
# Python auto-instrumentation
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://signoz-otel-collector.signoz.svc.cluster.local:4317
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: grpc
# Node.js auto-instrumentation
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://signoz-otel-collector.signoz.svc.cluster.local:4317
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: grpc
Apply to deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
# If Instrumentation CR is in different namespace, use: "apps/auto-instrumentation"
instrumentation.opentelemetry.io/inject-python: "true"
spec:
containers:
- name: app
image: my-app:latest
Creating Custom Dashboards
Infrastructure Dashboard
{
"title": "Kubernetes Cluster Overview",
"panels": [
{
"title": "Node CPU Usage",
"query": "100 - (avg by (\"k8s.node.name\") ({\"system.cpu.utilization\",\"state\"=\"idle\"}) * 100)",
"type": "timeseries"
},
{
"title": "Node Memory Usage",
"query": "avg by (\"k8s.node.name\") ({\"system.memory.utilization\"}) * 100",
"type": "gauge"
},
{
"title": "Running Pods by Namespace",
"query": "sum by (\"k8s.namespace.name\") ({\"k8s.pod.phase\",\"phase\"=\"Running\"})",
"type": "bar"
},
{
"title": "Ready Nodes",
"query": "sum ({\"k8s.node.condition\",\"condition\"=\"Ready\",\"status\"=\"true\"})",
"type": "stat"
}
],
"note": "For metrics like kube_pod_info and kubelet_volume_stats_*, you need kube-state-metrics and additional kubelet scrape configs"
}
Application Performance Dashboard
{
"title": "Application Performance",
"panels": [
{
"title": "Request Rate",
"query": "sum by (\"service.name\") (rate({\"http.server.request.duration.count\"}[5m]))",
"type": "timeseries"
},
{
"title": "Error Rate",
"query": "sum by (\"service.name\") (rate({\"http.server.request.duration.count\",\"http.response.status_code\"=~\"5..\"}[5m]))",
"type": "timeseries"
},
{
"title": "P95 Latency (seconds)",
"query": "histogram_quantile(0.95, sum by (le,\"service.name\") (rate({\"http.server.request.duration.bucket\"}[5m])))",
"type": "timeseries",
"note": "Uses histogram buckets for quantile calculation, unit is seconds"
},
{
"title": "Service Health",
"query": "ClickHouse SQL: SELECT service_name, count() FROM signoz_traces.distributed_signoz_index_v3 WHERE timestamp >= now() - INTERVAL 1 HOUR GROUP BY service_name",
"type": "table",
"note": "Use Trace Explorer UI or ClickHouse SQL for trace analysis"
}
]
}
Log Aggregation with Fluent Bit
Deploy Fluent Bit
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: signoz
data:
fluent-bit.conf: |
[SERVICE]
Daemon Off
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels Off
Annotations Off
[FILTER]
Name modify
Match kube.*
Add cluster.name homelab-cluster
Add environment production
[OUTPUT]
Name opentelemetry
Match kube.*
Host signoz-otel-collector.signoz.svc.cluster.local
Port 4318
Logs_uri /v1/logs
Log_response_payload True
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
Decode_Field_As escaped_utf8 log
Alert Rules
Critical Alerts
# alert-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alert-rules
namespace: signoz
data:
rules.yaml: |
groups:
- name: cluster-health
interval: 30s
rules:
- alert: NodeNotReady
expr: sum({\"k8s.node.condition\",\"condition\"=\"Ready\",\"status\"=\"false\"}) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "One or more nodes are NotReady"
- alert: HighMemoryUsage
expr: avg by (\"k8s.node.name\") ({\"system.memory.utilization\"}) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.\"k8s.node.name\" }} memory usage above 85%"
- alert: PodCrashLooping
expr: increase({\"k8s.container.restarts\"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.\"k8s.container.name\" }} in pod {{ $labels.\"k8s.namespace.name\" }}/{{ $labels.\"k8s.pod.name\" }} is restarting"
- alert: HighCPUUsage
expr: avg by (\"k8s.node.name\") (100 - {\"system.cpu.utilization\",\"state\"=\"idle\"} * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.\"k8s.node.name\" }} CPU usage above 80%"
- name: application-health
interval: 30s
rules:
- alert: HighErrorRate
expr: (
sum by (\"service.name\") (rate({\"http.server.request.duration.count\",\"http.response.status_code\"=~\"5..\"}[5m]))
/
clamp_min(sum by (\"service.name\") (rate({\"http.server.request.duration.count\"}[5m])), 1)
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.\"service.name\" }} error rate above 5%"
- alert: HighLatency
expr: histogram_quantile(0.95, sum by (le,\"service.name\") (rate({\"http.server.request.duration.bucket\"}[5m]))) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Service {{ $labels.\"service.name\" }} P95 latency above 1s"
Resource Optimization Discovery
Here's how I discovered the 31 CPU cores waste:
# Query to find over-provisioned resources (requires kube-state-metrics)
sum by (namespace, pod) (
sum by (namespace, pod, container) (kube_pod_container_resource_requests{resource="cpu"})
-
sum by (namespace, pod, container) (rate(container_cpu_usage_seconds_total[5m]))
) > 0.5
# Results showed:
# Tetragon: Requested 14 CPU, using 0.07 CPU
# Fluent-bit: Requested 3 CPU, using 0.018 CPU
# Rook-Ceph: Requested 15 CPU, using 0.2 CPU
Performance Impact
After proper observability implementation:
Before Observability:
Resource waste: 31 CPU cores, 63GB RAM
MTTR: 2-4 hours
Debugging: Random kubectl commands
Cost: ~$200/month in wasted resources
After Observability:
Resource optimization: Reclaimed 25 CPU, 45GB RAM
MTTR: 15-30 minutes
Debugging: Direct trace to root cause
Cost: Saved ~$150/month
Troubleshooting
High Cardinality Issues
# Check cardinality
curl http://signoz-clickhouse:8123/ -d "
SELECT
count(DISTINCT labels) as cardinality
FROM signoz_metrics.distributed_time_series_v4
WHERE metric_name = 'problematic_metric'
"
# Fix by adding drop rules
processors:
filter/drop_high_cardinality:
metrics:
exclude:
match_type: regexp
metric_names:
- .*_bucket_.*
Storage Growth
# Check storage usage
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
clickhouse-client --query "SELECT
database,
table,
formatReadableSize(sum(bytes)) as size
FROM system.parts
GROUP BY database, table
ORDER BY sum(bytes) DESC"
# Adjust retention (check table version first)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
clickhouse-client --query "SHOW TABLES FROM signoz_traces LIKE 'signoz_index_%'"
# Set TTL on local table on all shards (not distributed table)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
clickhouse-client --query "
ALTER TABLE signoz_traces.signoz_index_v3
ON CLUSTER '{cluster}'
MODIFY TTL timestamp + INTERVAL 3 DAY"
# Optionally apply TTL to existing parts (resource-intensive operation)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
clickhouse-client --query "
ALTER TABLE signoz_traces.signoz_index_v3
ON CLUSTER '{cluster}'
MATERIALIZE TTL"
Security Considerations
RBAC for OpenTelemetry Collector
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: signoz
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector-read
rules:
- apiGroups: [""]
resources: ["pods", "namespaces", "nodes", "events", "endpoints", "services"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "daemonsets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector-read-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector-read
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: signoz
60-Second Smoke Test
Verify your observability stack is working end-to-end:
1. Generate Test Signals
Deploy the OpenTelemetry demo application:
# Deploy otel-demo for signal generation
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-demo open-telemetry/opentelemetry-demo \
--namespace otel-demo \
--create-namespace \
--set opentelemetry-collector.enabled=false \
--set default.envOverrides[0].name=OTEL_EXPORTER_OTLP_ENDPOINT \
--set default.envOverrides[0].value=http://signoz-otel-collector.signoz.svc.cluster.local:4317 \
--set default.envOverrides[1].name=OTEL_EXPORTER_OTLP_PROTOCOL \
--set default.envOverrides[1].value=grpc
# Wait for deployment (frontend-proxy is the correct service name in the chart)
kubectl -n otel-demo rollout status deployment/otel-demo-frontend-proxy
# Generate some load
kubectl -n otel-demo port-forward svc/otel-demo-frontend-proxy 8080:8080 &
curl -s http://localhost:8080 > /dev/null
curl -s http://localhost:8080/product/OLJCESPC7Z > /dev/null
curl -s http://localhost:8080/cart > /dev/null
2. Verify Metrics (PromQL)
Check request rate per service in SigNoz UI:
sum by ("service.name") (rate({"http.server.request.duration.count"}[5m]))
Check for errors:
sum by ("service.name") (
rate({"http.server.request.duration.count","http.response.status_code"=~"5.."}[5m])
)
3. Verify Traces (ClickHouse SQL)
Services seen in the last hour:
SELECT service_name, count()
FROM signoz_traces.distributed_signoz_index_v3
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY service_name
ORDER BY count() DESC;
4. Verify Logs
Search for "otel-demo" in the SigNoz Logs UI. You should see container logs from the demo application flowing through Fluent Bit → OTLP.
5. Test Alerts
Trigger a test alert:
# Create high CPU load to trigger alert
kubectl run cpu-load --image=busybox --restart=Never -- sh -c "while true; do :; done"
# Alert should fire within 5 minutes based on HighCPUUsage rule
kubectl delete pod cpu-load
Expected Results:
- Metrics: >0 QPS for otel-demo services
- Traces: Multiple service names visible
- Logs: Container logs from otel-demo namespace
- Alerts: CPU alert fires and resolves
What's Next
With complete observability in place, you have the visibility needed to optimize and secure your cluster. In Part 8, we'll implement comprehensive security hardening with Kyverno policies, Tetragon runtime protection, and disaster recovery with Velero.
Key Takeaways
- Observability reveals hidden resource waste - I saved 31 CPU cores
- OpenTelemetry provides vendor-neutral instrumentation - No lock-in
- Unified platform beats multiple tools - Correlating is easier
- Cardinality management is critical - High cardinality kills performance
- Proper dashboards reduce MTTR - From hours to minutes
References
- SigNoz Documentation: https://signoz.io/docs/
- OpenTelemetry Documentation: https://opentelemetry.io/docs/
- ClickHouse Operations: https://clickhouse.com/docs/en/operations/
- Fluent Bit Documentation: https://docs.fluentbit.io/manual/
- Kubernetes Metrics: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/
Continue to Part 8: Security Hardening and Production Readiness →