Observability

Background To obtain more relevant metrics and a comprehensive understanding of traces when external API requests access our systems, We opted to enable tracing for the network mesh components situated above the Kubernetes service. This specifically covers the entire request flow from its initiation: the ingress gateway receives the request and routes it according to defined rules to a particular K8s service; then, the Istio sidecar proxy containers, running alongside our application containers in the same pod, receive and proxy the request to main containers, where our Node.js server process ultimately handle it and generate the appropriate response. Before enabling tracing for istio mesh, tracing for a HTTP request began when the request reached our application container at the pod level, with HTTP auto-instrumentation serving as the trace root. ...

This article is about how at work we solved the issue of high response time while executing Redis commands from Node.js server to a Redis compatible database known as dragonfly. Background After introducing metrics to our Node.js service, we started recording the overall response time whenever a Redis command was executed. We had a wrapper service around a Redis driver known as ioredis for interacting with our Redis-compatible database. Once we set up Grafana dashboards for metrics like cache latency, we saw unusually high p99 latency numbers, close to 200ms. This is a very large number, especially considering the underlying database query itself typically takes less than 10ms to complete. To understand why this latency was so high, we needed more detailed insight than metrics alone could provide. As part of a broader effort to set up our observability stack, I had been exploring various tracing solutions – options ranged from open-source SDKs (OpenTelemetry Node.js SDK) with a self-deployed trace backend, to third-party managed solutions (Datadog, Middleware, etc.). For this investigation, we decided to proceed with a self-hosted Grafana Tempo instance to test the setup and feasibility. (So far, the setup is working great, and I’m planning a detailed blog post on our observability architecture soon). With tracing set up, we could get a waterfall view of the path taken by the service while responding to things like HTTP requests or event processing, which we hoped would pinpoint the source of the delay in our Redis command execution. ...

Debugging HTTP 503 UC Errors in Istio Service Mesh

Debugging Redis Latency