OpenTelemetry for GenAI Observability Trial

Overview

OpenTelemetry for GenAI observability applies vendor-neutral telemetry to model calls, agent steps, tool invocations, token usage, exceptions, latency, and system behavior. The OpenTelemetry GenAI semantic conventions define signals for GenAI inputs and outputs as events, exceptions, operation metrics, model-operation spans, and agent-operation spans (OpenTelemetry).

This matters because AI incidents rarely live only inside the LLM layer. Teams need to correlate prompts, retrieval, tool calls, downstream services, infrastructure metrics, and application errors in the same observability workflow.

Keep this in Adopt for production AI services. Specialized LLMOps tools remain useful for evaluation and debugging, but OpenTelemetry should be the portability layer that connects AI traces to existing observability pipelines.

Adoption Signals

  • OpenTelemetry has dedicated GenAI semantic conventions for events, exceptions, metrics, model spans, and agent spans (OpenTelemetry).
  • The conventions include technology-specific guidance for Anthropic, Azure AI Inference, AWS Bedrock, OpenAI, and Model Context Protocol, which reflects broad provider coverage (OpenTelemetry).
  • LangSmith supports native tracing for popular agent frameworks and OpenTelemetry, and can send LangSmith trace data to existing tools or ingest OTel data into LangSmith (LangSmith).
  • Langfuse states that its tracing is based on OpenTelemetry to increase compatibility and reduce vendor lock-in, and supports traces across LLM calls, retrieval, embedding, API calls, sessions, and agent graphs (Langfuse).
  • Production dashboards increasingly track token usage, latency, error rates, cost breakdowns, feedback scores, and online evals as first-class AI operations metrics (LangSmith, Langfuse).

Risks

The GenAI conventions are still marked Development, so teams should expect naming, attribute, and stability changes as the standard matures (OpenTelemetry).

Prompt and response telemetry can contain sensitive data. Instrumentation must include redaction, sampling, retention policies, access controls, and clear rules for whether full content, hashes, metadata, or externalized payload references are stored.

Telemetry cost can grow quickly. Agent traces can include deeply nested runs, large payloads, many tool calls, and repeated model invocations, so teams should set sampling and retention policies early (LangSmith).

Observability does not replace evaluation. Traces show what happened; teams still need quality checks, safety checks, and regression suites to decide whether behavior is acceptable.

Pros & Cons

Advantages

  • Extends familiar telemetry practices to prompts, model calls, tokens, latency, and tool usage.
  • Helps correlate AI behavior with application traces and incidents.
  • Avoids isolated vendor dashboards by using an open observability standard.

Disadvantages

  • Sensitive prompt and response data require careful redaction and retention policies.
  • Semantic conventions for GenAI are still maturing.
  • Telemetry volume and cost can grow quickly for high-traffic AI systems.

Recommendation

Adopt OTel-compatible instrumentation for production AI services and agents. Capture model spans, tool spans, retrieval spans, token metrics, latency, errors, cache events, and structured correlation IDs, while redacting sensitive content by default.

Use LLMOps tools for prompt/version debugging, online evals, and trace review, but keep the telemetry foundation portable. Prefer platforms that can export to or ingest from OpenTelemetry so AI operations do not become a disconnected dashboard.

Sources