Skip to main content

11 – Expert Tips, Best Practices & Troubleshooting

Learning Objectives

  • Apply instrumentation and naming best practices.
  • Diagnose common Collector → Tempo issues.
  • Reduce noise and control costs while preserving signal.

Instrumentation Best Practices

  • Name spans consistently: service.operation (e.g., orders.create).
  • Add semantic attributes: http.method, http.status_code, db.system, db.statement (sanitized), peer.service.
  • Propagate context through message queues using W3C headers in metadata.
  • Limit span events; prefer concise, meaningful checkpoints.

Common Issues (Collector → Tempo)

  • OTLP endpoint mismatch (4317 gRPC vs 4318 HTTP).
  • TLS and auth headers misconfigured.
  • Backpressure/drops: raise queues, enable retries, tune batch sizes.
  • Time skew: ensure NTP is healthy across nodes.

Reducing Trace Noise

  • Sample by trace ID ratio at the edge; use tail-based sampling to keep errors and outliers.
  • Drop health checks and static asset requests.
  • Use feature flags to temporarily enable deep spans in hot code paths.

Cost Optimization

  • Tune retention by environment and tenant.
  • Compact aggressively off-peak; consider cold storage archiving.
  • Avoid high-cardinality attributes that don’t aid debugging.

Checklist (Ops Readiness)

  • Consistent resource attributes across services (service.name, service.namespace, deployment.environment).
  • Tail-based sampling rules defined and tested.
  • Grafana dashboards and alerts with exemplars linking to Tempo.
  • Runbooks for ingestion failures and slow queries.

Hands-on Lab

  1. Intentionally misconfigure the exporter endpoint; identify and fix the issue.
  2. Add a sampling rule: keep 100% of errors, 10% of success.

Quiz (Self-check)

  • When should you choose tail-based sampling over head-based?
  • How do you correlate metrics panels to specific traces?

Resources

  • OTel Semantic Conventions
  • Grafana Incident Response templates (community)

Visual: Troubleshooting Flow