Current role · Aug 2025 – Present · Sunrise, FL

UKG · AI/MLOps.

Building the observability and feedback layer underneath UKG's AI pillar. Engineers debug production incidents faster, data scientists evaluate agent behavior, and product analysts act on customer signal in hours instead of weeks.

−40%

MTTD reduction

−74%

Pipeline latency

−53%

Query turnaround

Eliminated

Span loss

The problem

AI microservices are hard to observe the way you observe regular backends.

Standard APM tells you which endpoint was slow. It doesn't tell you what the agent decided, which tools it called, what the user was asking for, or whether the model's behavior drifted. When an AI feature misfires in production, the person on-call needs all of those, and the data scientist fixing the model needs the same data laid out differently. The work here is turning that one messy request path into a single, connected, queryable record, then splitting it cleanly for each audience (engineers, data scientists, product analysts) so everyone can answer the question they actually have.

What I built

Five shipped pieces.

Workstream

OpenTelemetry across the AI pillar

Instrumented every AI microservice with OpenTelemetry so a single request (user → gateway → agent → tool call → downstream service) shows up as one connected trace. Unified trace IDs, consistent attribute names, auto-instrumented client libs. Cut mean time to debug production incidents by 40% because on-call engineers stopped grepping logs across services.

Workstream

Go DLP pipeline + OTel Collector tuning

Rewrote the DLP (Data Loss Prevention) processing pipeline and the OTel Collector config in Go, switching to Google Cloud DLP's table format. Cut end-to-end latency by 74% and eliminated span loss under load. The old pipeline dropped spans during bursts, which broke trace completeness exactly when incidents happened.

Workstream

Go trace processor → BigQuery

Built a Go processor that pulls agent-specific metrics out of OpenTelemetry spans (agent inputs and outputs, detected intent, tool calls, token usage) and loads them into BigQuery. Data scientists used to wait on ad-hoc queries against raw traces; now the agent evaluation surface is its own warehouse table. Cut their query turnaround by 53%.

Workstream

Pub/Sub feedback loop

Architected a Pub/Sub pipeline that ingests real-time customer feedback from product surfaces (thumbs, regen, correction, abandonment) and fans it out to analytics, model evaluation, and dashboards. Product analysts no longer depend on weekly data pulls to act on customer signal.

Workstream

Grafana + Arize integration

Stood up shared Grafana dashboards for service-level observability and Arize for model-level evaluation across the AI microservices. One URL to check if a service is healthy; another to check if the model behind it is still behaving. Both are exporters off the same OTel pipeline, so the data never forks.

Stack

OpenTelemetryGoGoogle Cloud PlatformGoogle Cloud DLPPub/SubBigQueryGrafanaArize

What it taught me

Observability is a product surface.

Engineers, data scientists, and product analysts need the sameunderlying telemetry presented three different ways. Build the pipeline once, branch it at the exporter. Don't duplicate collection.
For LLM features, the most valuable telemetry is usually the non-technical part: detected intent, tool calls, prompt/response pairs. APM vendors don't capture this by default. You build it.
Writing the Collector in Go was worth the investment. When it matters most (incidents, load spikes) you don't want your observability pipeline to be the thing dropping data.
A feedback loop from customers → product data is a distribution problem, not a model problem. Pub/Sub, not smarter ML, closes the gap between "customer is mad" and "team knows about it."

Full experience timeline →Back to skills →