UKG — AI/MLOps.
Building the observability and feedback layer underneath UKG's AI pillar — so engineers can debug production incidents, data scientists can evaluate agent behavior, and product analysts can act on customer signal in hours instead of weeks.
−40%
MTTD reduction
−74%
Pipeline latency
−53%
Query turnaround
Eliminated
Span loss
AI microservices are hard to observe the way you observe regular backends.
Standard APM tells you which endpoint was slow. It doesn't tell you what the agent decided, which tools it called, what the user was asking for, or whether the model's behavior drifted. When an AI feature misfires in production, the person on-call needs all of those — and the data scientist fixing the model needs the same data laid out completely differently. The work here is turning that one messy request path into a single, connected, queryable record — then splitting it cleanly for each audience (engineers, data scientists, product analysts) so everyone can answer the question they actually have.
Five shipped pieces.
OpenTelemetry across the AI pillar
Instrumented every AI microservice with OTel so a single request — user → gateway → agent → tool call → downstream service — shows up as one connected trace. Unified trace IDs, consistent attribute names, auto-instrumented client libs. Cut mean time to debug production incidents by 40% because on-call engineers stopped grepping logs across services.
Go DLP pipeline + OTel Collector tuning
Rewrote the DLP (Data Loss Prevention) processing pipeline and the OTel Collector config in Go, switching to Google Cloud DLP's table format. Cut end-to-end latency by 74% and eliminated span loss under load — the old pipeline dropped spans during bursts, which was breaking trace completeness exactly when incidents happened.
Go trace processor → BigQuery
Built a Go processor that pulls agent-specific metrics out of OTel spans — agent inputs / outputs, detected intent, tool calls, token usage — and loads them into BigQuery. Data scientists used to wait on ad-hoc queries against raw traces; now the agent evaluation surface is its own warehouse table. Cut their query turnaround by 53%.
Pub/Sub feedback loop
Architected a Pub/Sub pipeline that ingests real-time customer feedback from product surfaces (thumbs, regen, correction, abandonment) and fans it out to analytics, model evaluation, and dashboards. Product analysts no longer depend on weekly data pulls to act on customer signal.
Grafana + Arize integration
Stood up shared Grafana dashboards for service-level observability and Arize for model-level evaluation across the AI microservices. One URL to check if a service is healthy; another to check if the model behind it is still behaving. Both are exporters off the same OTel pipeline, so the data never forks.
Observability is a product surface.
- Engineers, data scientists, and product analysts need the sameunderlying telemetry presented three different ways. Build the pipeline once, branch it at the exporter — don't duplicate collection.
- For LLM features, the most valuable telemetry is usually the non-technical part: detected intent, tool calls, prompt/response pairs. APM vendors don't capture this by default — you build it.
- Writing the Collector in Go was worth the investment — when it matters most (incidents, load spikes) you don't want your observability pipeline to be the thing dropping data.
- A feedback loop from customers → product data is a distribution problem, not a model problem. Pub/Sub, not smarter ML, closes the gap between "customer is mad" and "team knows about it."