Core Concepts
Proactive monitoring and detailed analysis are crucial for identifying and resolving performance issues in critical services, ensuring high-quality user experiences.
Abstract
The content describes the investigation process undertaken by the Typeform team to address a performance issue in their Renderer service, which is responsible for displaying forms to end users. After deploying a new version of the service, the team noticed a curious "Sawtooth" pattern in the response times, with sudden drops followed by gradual increases.
To tackle the issue, the team leveraged various tools and techniques:
Metrics and Tracing: The team used Datadog Metrics and distributed tracing to detect the problem and narrow down the issue, identifying a discrepancy between the time spent in the Renderer service and the downstream Forms API service.
Runtime Metrics: The team added Node.js runtime metrics to their toolkit, which provided insights into CPU usage, memory usage, and event loop delays, suggesting the presence of expensive object creation in the code.
Profiler: The Datadog Profiler allowed the team to compare flame graphs between different time frames, highlighting the trackPrivateForm function as the potential source of the problem.
After identifying the root cause, the team optimized the code by moving the creation of the RudderstackAnalytics instance outside the trackPrivateForm function, ensuring it is created only once. This optimization resolved the latency issue, and the Sawtooth pattern in the response times was no longer observed.
The key takeaways from this investigation are the importance of proactive monitoring, detailed analysis, and prompt issue resolution to maintain robust and efficient systems, even under heavy traffic.
Stats
The renderer was taking about 100ms to call the service which stores form definitions, forms-api, where it itself only took a few milliseconds to process the request.
During the period that the response time latency accumulated, the team observed an increase in CPU usage and the event loop delay, but no increase in memory usage.
Quotes
"Tools like Datadog Metrics are really great for detecting that there is a problem and even pinpointing the time it first occurred. We could even confirm and narrow down the issue with distributed tracing."
"The Datadog Profiler allows us to choose different time frames and compare their flame graphs. It even conveniently highlights functions in which more time is spent in a given sample compared to another."