Part 7: Complete Observability Stack

I discovered I was wasting 31 CPU cores and 63GB of RAM only after implementing proper observability. Before SigNoz, I was flying blind - guessing at performance issues and discovering problems only when things broke. This article covers deploying a complete observability stack with SigNoz, OpenTelemetry, and Fluent Bit, giving you the visibility needed to run a production cluster.

Why SigNoz for Kubernetes Observability

After evaluating Prometheus + Grafana, Elastic Stack, and SigNoz, here's why SigNoz won:

Single Platform for Everything

Metrics, traces, and logs in one UI
No need to correlate across multiple tools
Multiple query interfaces: PromQL, visual Query Builder, and ClickHouse SQL

OpenTelemetry Native

Built on open standards, not proprietary formats
Easy migration if needed
Huge ecosystem support

Cost Effective

Open source with no per-node licensing
Efficient columnar storage (ClickHouse)
Better compression than Elasticsearch

Kubernetes Optimized

Auto-discovery of services
Pre-built dashboards for Kubernetes
Native support for container metrics

Pre-Installation Planning

Resource Requirements

Based on my experience (Linear issue COM-184) monitoring a 7-node cluster with multiple applications:

SigNoz Components (Actual Usage):
  ClickHouse: 2 CPU, 8GB RAM           # Database for metrics/traces/logs
  SigNoz (unified): 600m CPU, 1.5GB RAM # Query Service + Frontend + Alertmanager (v0.76+)
  OTEL Collector: 1 CPU, 2GB RAM       # Collects and processes telemetry

Total: ~4 CPU, 12GB RAM for monitoring 7 nodes
Additional: ~2 CPU, 4GB RAM total for DaemonSet collectors across all 7 nodes (~0.3 CPU, 0.6GB per node)

Day-2 Operations: Resource Monitoring

# Monitor SigNoz resource usage over time
kubectl top pods -n signoz --sort-by=memory
kubectl top pods -n signoz --sort-by=cpu

# Check storage usage (ClickHouse grows quickly)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT
    database,
    formatReadableSize(sum(data_compressed_bytes)) as compressed,
    formatReadableSize(sum(data_uncompressed_bytes)) as uncompressed
  FROM system.parts
  GROUP BY database ORDER BY sum(data_compressed_bytes) DESC"

# Expected output showing storage by database:
# signoz_metrics    2.1 GiB    8.3 GiB
# signoz_traces     1.4 GiB    4.2 GiB
# signoz_logs       3.2 GiB    12.1 GiB

Storage Planning

Data Retention Planning:
  Metrics: 30 days @ 10GB/day = 300GB
    # Reason: Need month-over-month comparisons for capacity planning
  Traces: 7 days @ 5GB/day = 35GB
    # Reason: Traces are for immediate debugging, not long-term analysis
  Logs: 14 days @ 20GB/day = 280GB
    # Reason: Logs needed for security analysis and compliance

Total Storage: ~615GB (provision 1TB for growth)

Storage Performance Requirements:
  Read IOPS: 5000+ (queries scan historical data)
  Write IOPS: 2000+ (constant ingestion)
  Sequential Read: 200+ MB/s (large query responses)

Installing SigNoz

Deploy with Helm

# signoz-values.yaml - Updated for SigNoz v0.76+ unified binary
global:
  storageClass: ceph-block-fast
  clusterName: homelab-cluster

clickhouse:
  enabled: true
  replicaCount: 1  # Single instance for homelab
  persistence:
    enabled: true
    size: 500Gi
  resources:
    requests:
      cpu: 1000m
      memory: 4Gi
    limits:
      cpu: 4000m
      memory: 8Gi
  configuration:
    settings:
      # Performance tuning for homelab
      max_memory_usage: "6000000000"  # 6GB
      max_memory_usage_for_user: "5000000000"  # 5GB
      max_bytes_before_external_group_by: "3000000000"  # 3GB
      distributed_aggregation_memory_efficient: "1"
      # Compression settings (Operator format)
      compression/case/method: "zstd"

# Unified signoz service (v0.76+: Query Service + Frontend + Alertmanager)
signoz:
  enabled: true
  replicaCount: 2  # HA for queries
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 2Gi
  # Note: Retention is now configured in the UI (General → Settings)
  # New retention settings apply only to newly ingested data

  # Configuration via environment variables (unified binary)
  additionalEnvs:
    # Enable dot metrics support for OTel metric names (required for dashboard queries)
    - name: DOT_METRICS_ENABLED
      value: "true"
    # Alerting configuration
    - name: SIGNOZ_ALERTMANAGER_SIGNOZ_EXTERNAL__URL
      value: "https://signoz.homelab.example"
    - name: SIGNOZ_ALERTMANAGER_SIGNOZ_GLOBAL_SMTP__SMARTHOST
      value: "smtp.homelab.example:587"
    - name: SIGNOZ_ALERTMANAGER_SIGNOZ_GLOBAL_SMTP__FROM
      value: "[email protected]"
    # Webhook receiver for alerts
    - name: SIGNOZ_ALERTMANAGER_SIGNOZ_RECEIVERS__0__WEBHOOK__CONFIGS__0__URL
      value: "http://webhook-receiver.monitoring.svc.cluster.local/alerts"

otelCollector:
  enabled: true
  replicaCount: 1  # Must be 1 when k8s_cluster receiver is enabled (prevents duplicate metrics)
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  config:
    receivers:
      # Kubernetes metrics (MUST run on exactly one replica to prevent duplicates)
      k8s_cluster:
        collection_interval: 30s
        node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]

      # Note: Host metrics are collected by the DaemonSet, not here
      # This prevents duplicate collection of the same metrics

      # Prometheus receiver for scraping
      prometheus:
        config:
          scrape_configs:
            - job_name: kubernetes-pods
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: keep
                  source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  regex: true
                - action: replace
                  source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
                  target_label: __metrics_path__
                  regex: (.+)
                - action: replace
                  source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
                  regex: ([^:]+)(?::\d+)?;(\d+)
                  replacement: ${1}:${2}
                  target_label: __address__
                - action: labelmap
                  regex: __meta_kubernetes_pod_label_(.+)

    processors:
      # Add cluster metadata
      resource:
        attributes:
          - key: cluster.name
            value: homelab-cluster
            action: upsert
          - key: environment
            value: production
            action: upsert

      # Memory limiter to prevent OOM (must be first)
      memory_limiter:
        check_interval: 1s
        limit_mib: 3072
        spike_limit_mib: 512

      # Batch for efficiency
      batch:
        timeout: 10s
        send_batch_size: 10000

SigNoz is a single binary since v0.76; this chart deploys one signoz deployment (UI+API+Alertmanager).

Install SigNoz:

# Create namespace
kubectl create namespace signoz

# Add Helm repository
helm repo add signoz https://charts.signoz.io
helm repo update

# Install SigNoz (get latest chart version)
SIGNOZ_CHART_VERSION=$(helm search repo signoz/signoz --versions -o json | jq -r '.[0].version')
helm show chart signoz/signoz --version "$SIGNOZ_CHART_VERSION" | grep appVersion
# appVersion: 0.94.x  ← verify before installing
# Note: v0.94.x requires ClickHouse 25.5.6 - verify your CH image/tag
helm install signoz signoz/signoz \
  --namespace signoz \
  --version "$SIGNOZ_CHART_VERSION" \
  --values signoz-values.yaml

# Wait for rollout
kubectl -n signoz rollout status deployment/signoz
kubectl -n signoz rollout status statefulset/chi-signoz-clickhouse-cluster-0-0

Configuring OpenTelemetry Collection

Deploy OTEL Collector DaemonSet

# otel-collector-daemonset.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: signoz
data:
  config.yaml: |
    receivers:
      # Collect host metrics (requires hostfs mount for accurate node metrics)
      # Note: Without hostfs mount + root_path, you'd get container stats instead of actual node stats
      hostmetrics:
        root_path: /hostfs
        collection_interval: 30s
        scrapers:
          cpu: {}
          memory: {}
          disk: {}
          network: {}
          filesystem: {}
          load: {}

      # Kubernetes events (using k8sobjects - future-proof replacement for deprecated k8s_events)
      k8sobjects:
        objects:
          - name: events
            mode: watch

      # Note: Container logs are handled by Fluent Bit → OTLP/HTTP on port 4318
      # This prevents duplicate log ingestion. Remove this comment if using OTel-only setup.

      # Collect Prometheus metrics
      prometheus:
        config:
          scrape_configs:
            - job_name: kubelet
              scheme: https
              tls_config:
                insecure_skip_verify: true
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
                - role: node
              relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__

            # Container metrics via cAdvisor (for resource waste analysis)
            - job_name: kubelet-cadvisor
              scheme: https
              tls_config:
                insecure_skip_verify: true
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
                - role: node
              relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__

    processors:
      # Add Kubernetes attributes
      k8sattributes:
        auth_type: serviceAccount
        passthrough: false
        filter:
          node_from_env_var: KUBE_NODE_NAME
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.statefulset.name
            - k8s.daemonset.name
            - k8s.cronjob.name
            - k8s.job.name
            - k8s.node.name
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.pod.start_time
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.ip
          - sources:
              - from: resource_attribute
                name: k8s.pod.uid
          - sources:
              - from: connection

      # Resource detection
      resourcedetection/system:
        detectors: [system, env, docker]
        timeout: 2s
        override: false

      # Memory limiting (must be first)
      memory_limiter:
        check_interval: 1s
        limit_mib: 1024
        spike_limit_mib: 256

      # Batch processing
      batch:
        timeout: 10s
        send_batch_size: 10000

    exporters:
      # Send to SigNoz
      otlp/signoz:
        endpoint: signoz-otel-collector.signoz.svc.cluster.local:4317
        tls:
          insecure: true
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s

    service:
      pipelines:
        metrics:
          receivers: [hostmetrics, prometheus]
          processors: [memory_limiter, k8sattributes, resourcedetection/system, batch]
          exporters: [otlp/signoz]

        logs:
          receivers: [k8sobjects]  # Using k8sobjects for events (filelog removed to prevent duplicate ingestion with Fluent Bit)
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp/signoz]

      telemetry:
        logs:
          level: info
        metrics:
          level: detailed
          address: 0.0.0.0:8888
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: signoz
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.132.0
        command: ["/otelcol-contrib", "--config=/conf/config.yaml"]
        env:
        - name: KUBE_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: config
          mountPath: /conf
        - name: hostfs
          mountPath: /hostfs
          readOnly: true
          mountPropagation: HostToContainer
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 1Gi
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
      - name: hostfs
        hostPath:
          path: /
      tolerations:
      - operator: Exists

Deploy the collector:

kubectl apply -f otel-collector-daemonset.yaml

Instrumenting Applications

Auto-Instrumentation with OTEL Operator

# otel-instrumentation.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: apps
spec:
  # Java auto-instrumentation
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://signoz-otel-collector.signoz.svc.cluster.local:4317
      - name: OTEL_EXPORTER_OTLP_PROTOCOL
        value: grpc
      - name: OTEL_RESOURCE_ATTRIBUTES
        value: "service.namespace=apps,environment=production"

  # Python auto-instrumentation
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://signoz-otel-collector.signoz.svc.cluster.local:4317
      - name: OTEL_EXPORTER_OTLP_PROTOCOL
        value: grpc

  # Node.js auto-instrumentation
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://signoz-otel-collector.signoz.svc.cluster.local:4317
      - name: OTEL_EXPORTER_OTLP_PROTOCOL
        value: grpc

Apply to deployments:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        # If Instrumentation CR is in different namespace, use: "apps/auto-instrumentation"
        instrumentation.opentelemetry.io/inject-python: "true"
    spec:
      containers:
      - name: app
        image: my-app:latest

Creating Custom Dashboards

Infrastructure Dashboard

{
  "title": "Kubernetes Cluster Overview",
  "panels": [
    {
      "title": "Node CPU Usage",
      "query": "100 - (avg by (\"k8s.node.name\") ({\"system.cpu.utilization\",\"state\"=\"idle\"}) * 100)",
      "type": "timeseries"
    },
    {
      "title": "Node Memory Usage",
      "query": "avg by (\"k8s.node.name\") ({\"system.memory.utilization\"}) * 100",
      "type": "gauge"
    },
    {
      "title": "Running Pods by Namespace",
      "query": "sum by (\"k8s.namespace.name\") ({\"k8s.pod.phase\",\"phase\"=\"Running\"})",
      "type": "bar"
    },
    {
      "title": "Ready Nodes",
      "query": "sum ({\"k8s.node.condition\",\"condition\"=\"Ready\",\"status\"=\"true\"})",
      "type": "stat"
    }
  ],
  "note": "For metrics like kube_pod_info and kubelet_volume_stats_*, you need kube-state-metrics and additional kubelet scrape configs"
}

Application Performance Dashboard

{
  "title": "Application Performance",
  "panels": [
    {
      "title": "Request Rate",
      "query": "sum by (\"service.name\") (rate({\"http.server.request.duration.count\"}[5m]))",
      "type": "timeseries"
    },
    {
      "title": "Error Rate",
      "query": "sum by (\"service.name\") (rate({\"http.server.request.duration.count\",\"http.response.status_code\"=~\"5..\"}[5m]))",
      "type": "timeseries"
    },
    {
      "title": "P95 Latency (seconds)",
      "query": "histogram_quantile(0.95, sum by (le,\"service.name\") (rate({\"http.server.request.duration.bucket\"}[5m])))",
      "type": "timeseries",
      "note": "Uses histogram buckets for quantile calculation, unit is seconds"
    },
    {
      "title": "Service Health",
      "query": "ClickHouse SQL: SELECT service_name, count() FROM signoz_traces.distributed_signoz_index_v3 WHERE timestamp >= now() - INTERVAL 1 HOUR GROUP BY service_name",
      "type": "table",
      "note": "Use Trace Explorer UI or ClickHouse SQL for trace analysis"
    }
  ]
}

Log Aggregation with Fluent Bit

Deploy Fluent Bit

# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: signoz
data:
  fluent-bit.conf: |
    [SERVICE]
        Daemon Off
        Flush 5
        Log_Level info
        Parsers_File parsers.conf

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        Mem_Buf_Limit 5MB
        Skip_Long_Lines On
        Refresh_Interval 10

    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_URL https://kubernetes.default.svc:443
        Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix kube.var.log.containers.
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On
        Labels Off
        Annotations Off

    [FILTER]
        Name modify
        Match kube.*
        Add cluster.name homelab-cluster
        Add environment production

    [OUTPUT]
        Name           opentelemetry
        Match          kube.*
        Host           signoz-otel-collector.signoz.svc.cluster.local
        Port           4318
        Logs_uri       /v1/logs
        Log_response_payload  True

  parsers.conf: |
    [PARSER]
        Name docker
        Format json
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%LZ
        Decode_Field_As escaped_utf8 log

Alert Rules

Critical Alerts

# alert-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alert-rules
  namespace: signoz
data:
  rules.yaml: |
    groups:
      - name: cluster-health
        interval: 30s
        rules:
          - alert: NodeNotReady
            expr: sum({\"k8s.node.condition\",\"condition\"=\"Ready\",\"status\"=\"false\"}) > 0
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "One or more nodes are NotReady"

          - alert: HighMemoryUsage
            expr: avg by (\"k8s.node.name\") ({\"system.memory.utilization\"}) > 0.85
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Node {{ $labels.\"k8s.node.name\" }} memory usage above 85%"

          - alert: PodCrashLooping
            expr: increase({\"k8s.container.restarts\"}[5m]) > 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Container {{ $labels.\"k8s.container.name\" }} in pod {{ $labels.\"k8s.namespace.name\" }}/{{ $labels.\"k8s.pod.name\" }} is restarting"

          - alert: HighCPUUsage
            expr: avg by (\"k8s.node.name\") (100 - {\"system.cpu.utilization\",\"state\"=\"idle\"} * 100) > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Node {{ $labels.\"k8s.node.name\" }} CPU usage above 80%"

      - name: application-health
        interval: 30s
        rules:
          - alert: HighErrorRate
            expr: (
              sum by (\"service.name\") (rate({\"http.server.request.duration.count\",\"http.response.status_code\"=~\"5..\"}[5m]))
            /
              clamp_min(sum by (\"service.name\") (rate({\"http.server.request.duration.count\"}[5m])), 1)
            ) > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Service {{ $labels.\"service.name\" }} error rate above 5%"

          - alert: HighLatency
            expr: histogram_quantile(0.95, sum by (le,\"service.name\") (rate({\"http.server.request.duration.bucket\"}[5m]))) > 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Service {{ $labels.\"service.name\" }} P95 latency above 1s"

Resource Optimization Discovery

Here's how I discovered the 31 CPU cores waste:

# Query to find over-provisioned resources (requires kube-state-metrics)
sum by (namespace, pod) (
  sum by (namespace, pod, container) (kube_pod_container_resource_requests{resource="cpu"})
-
  sum by (namespace, pod, container) (rate(container_cpu_usage_seconds_total[5m]))
) > 0.5

# Results showed:
# Tetragon: Requested 14 CPU, using 0.07 CPU
# Fluent-bit: Requested 3 CPU, using 0.018 CPU
# Rook-Ceph: Requested 15 CPU, using 0.2 CPU

Performance Impact

After proper observability implementation:

Before Observability:
  Resource waste: 31 CPU cores, 63GB RAM
  MTTR: 2-4 hours
  Debugging: Random kubectl commands
  Cost: ~$200/month in wasted resources

After Observability:
  Resource optimization: Reclaimed 25 CPU, 45GB RAM
  MTTR: 15-30 minutes
  Debugging: Direct trace to root cause
  Cost: Saved ~$150/month

Troubleshooting

High Cardinality Issues

# Check cardinality
curl http://signoz-clickhouse:8123/ -d "
  SELECT
    count(DISTINCT labels) as cardinality
  FROM signoz_metrics.distributed_time_series_v4
  WHERE metric_name = 'problematic_metric'
"

# Fix by adding drop rules
processors:
  filter/drop_high_cardinality:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - .*_bucket_.*

Storage Growth

# Check storage usage
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT
    database,
    table,
    formatReadableSize(sum(bytes)) as size
  FROM system.parts
  GROUP BY database, table
  ORDER BY sum(bytes) DESC"

# Adjust retention (check table version first)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
  clickhouse-client --query "SHOW TABLES FROM signoz_traces LIKE 'signoz_index_%'"

# Set TTL on local table on all shards (not distributed table)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
  clickhouse-client --query "
    ALTER TABLE signoz_traces.signoz_index_v3
    ON CLUSTER '{cluster}'
    MODIFY TTL timestamp + INTERVAL 3 DAY"

# Optionally apply TTL to existing parts (resource-intensive operation)
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- \
  clickhouse-client --query "
    ALTER TABLE signoz_traces.signoz_index_v3
    ON CLUSTER '{cluster}'
    MATERIALIZE TTL"

Security Considerations

RBAC for OpenTelemetry Collector

apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: signoz
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-read
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces", "nodes", "events", "endpoints", "services"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets", "daemonsets", "statefulsets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs", "cronjobs"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-read-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-read
subjects:
- kind: ServiceAccount
  name: otel-collector
  namespace: signoz

60-Second Smoke Test

Verify your observability stack is working end-to-end:

1. Generate Test Signals

Deploy the OpenTelemetry demo application:

# Deploy otel-demo for signal generation
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-demo open-telemetry/opentelemetry-demo \
  --namespace otel-demo \
  --create-namespace \
  --set opentelemetry-collector.enabled=false \
  --set default.envOverrides[0].name=OTEL_EXPORTER_OTLP_ENDPOINT \
  --set default.envOverrides[0].value=http://signoz-otel-collector.signoz.svc.cluster.local:4317 \
  --set default.envOverrides[1].name=OTEL_EXPORTER_OTLP_PROTOCOL \
  --set default.envOverrides[1].value=grpc

# Wait for deployment (frontend-proxy is the correct service name in the chart)
kubectl -n otel-demo rollout status deployment/otel-demo-frontend-proxy

# Generate some load
kubectl -n otel-demo port-forward svc/otel-demo-frontend-proxy 8080:8080 &
curl -s http://localhost:8080 > /dev/null
curl -s http://localhost:8080/product/OLJCESPC7Z > /dev/null
curl -s http://localhost:8080/cart > /dev/null

2. Verify Metrics (PromQL)

Check request rate per service in SigNoz UI:

sum by ("service.name") (rate({"http.server.request.duration.count"}[5m]))

Check for errors:

sum by ("service.name") (
  rate({"http.server.request.duration.count","http.response.status_code"=~"5.."}[5m])
)

3. Verify Traces (ClickHouse SQL)

Services seen in the last hour:

SELECT service_name, count()
FROM signoz_traces.distributed_signoz_index_v3
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY service_name
ORDER BY count() DESC;

4. Verify Logs

Search for "otel-demo" in the SigNoz Logs UI. You should see container logs from the demo application flowing through Fluent Bit → OTLP.

5. Test Alerts

Trigger a test alert:

# Create high CPU load to trigger alert
kubectl run cpu-load --image=busybox --restart=Never -- sh -c "while true; do :; done"
# Alert should fire within 5 minutes based on HighCPUUsage rule
kubectl delete pod cpu-load

Expected Results:

Metrics: >0 QPS for otel-demo services
Traces: Multiple service names visible
Logs: Container logs from otel-demo namespace
Alerts: CPU alert fires and resolves

What's Next

With complete observability in place, you have the visibility needed to optimize and secure your cluster. In Part 8, we'll implement comprehensive security hardening with Kyverno policies, Tetragon runtime protection, and disaster recovery with Velero.

Key Takeaways

Observability reveals hidden resource waste - I saved 31 CPU cores
OpenTelemetry provides vendor-neutral instrumentation - No lock-in
Unified platform beats multiple tools - Correlating is easier
Cardinality management is critical - High cardinality kills performance
Proper dashboards reduce MTTR - From hours to minutes

References

SigNoz Documentation: https://signoz.io/docs/
OpenTelemetry Documentation: https://opentelemetry.io/docs/
ClickHouse Operations: https://clickhouse.com/docs/en/operations/
Fluent Bit Documentation: https://docs.fluentbit.io/manual/
Kubernetes Metrics: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/

Continue to Part 8: Security Hardening and Production Readiness →