11 – Expert Tips, Best Practices & Troubleshooting
Learning Objectives
- Apply instrumentation and naming best practices.
- Diagnose common Collector → Tempo issues.
- Reduce noise and control costs while preserving signal.
Instrumentation Best Practices
- Name spans consistently:
service.operation (e.g., orders.create).
- Add semantic attributes:
http.method, http.status_code, db.system, db.statement (sanitized), peer.service.
- Propagate context through message queues using W3C headers in metadata.
- Limit span events; prefer concise, meaningful checkpoints.
Common Issues (Collector → Tempo)
- OTLP endpoint mismatch (4317 gRPC vs 4318 HTTP).
- TLS and auth headers misconfigured.
- Backpressure/drops: raise queues, enable retries, tune batch sizes.
- Time skew: ensure NTP is healthy across nodes.
Reducing Trace Noise
- Sample by trace ID ratio at the edge; use tail-based sampling to keep errors and outliers.
- Drop health checks and static asset requests.
- Use feature flags to temporarily enable deep spans in hot code paths.
Cost Optimization
- Tune retention by environment and tenant.
- Compact aggressively off-peak; consider cold storage archiving.
- Avoid high-cardinality attributes that don’t aid debugging.
Checklist (Ops Readiness)
Hands-on Lab
- Intentionally misconfigure the exporter endpoint; identify and fix the issue.
- Add a sampling rule: keep 100% of errors, 10% of success.
Quiz (Self-check)
- When should you choose tail-based sampling over head-based?
- How do you correlate metrics panels to specific traces?
Resources
- OTel Semantic Conventions
- Grafana Incident Response templates (community)
Visual: Troubleshooting Flow