Incident Spotlight: The Challenge of Rising SQL Query Latency

In the world of financial technology, Qonto stands as a beacon of innovation, delivering seamless banking solutions to businesses across Europe. However, as their user base expanded, they faced an escalating challenge—rising SQL query latency on their AWS RDS instances. This latency threatened the performance of their applications, risking user dissatisfaction and potential financial loss. The incident highlighted a critical need for enhanced monitoring and predictive alerting mechanisms to ensure their databases could handle the increasing load without compromising speed or reliability.

Unveiling the Root Cause: Maximum Provisioned IOPS Saturation

As the Qonto engineering team delved into the issue, they discovered that the root cause of the SQL query latency was the saturation of maximum provisioned IOPS (Input/Output Operations Per Second). The IOPS cap, though initially sufficient, had been maxed out due to the surge in database transactions. This bottleneck was not immediately apparent, as standard AWS monitoring tools provided limited insights into the underlying performance metrics that directly impacted their RDS instances. The team realized that a more granular and proactive approach to monitoring was necessary to prevent such incidents from recurring.

Introducing Qonto’s Database Monitoring Framework (DMF)

In response to this challenge, Qonto’s engineering team developed an in-house solution—Qonto’s Database Monitoring Framework (DMF). This open-source framework was designed to provide a comprehensive and unified view of database performance metrics tailored explicitly to AWS RDS instances. The DMF enables real-time monitoring of key performance indicators (KPIs) such as IOPS usage, CPU utilization, and SQL query performance, all consolidated into a single dashboard. By leveraging DMF, Qonto could proactively detect potential performance bottlenecks before they escalated into critical incidents.

Crafting the Metrics Exporter: Consolidating Data from Multiple AWS APIs

At the heart of Qonto’s DMF is a robust metrics exporter meticulously crafted to gather and consolidate data from multiple AWS APIs. This exporter pulls data from CloudWatch, RDS Enhanced Monitoring, and other AWS services, providing a holistic view of the database environment. The data is then normalized and structured for easy consumption by monitoring tools like Prometheus and Grafana. This approach enhances visibility into RDS performance and simplifies setting up custom alerts and dashboards, empowering engineers to make informed decisions swiftly.

Advanced Alerting Strategies for Early Detection

To complement the DMF, Qonto implemented advanced alerting strategies designed for early detection of performance issues. By analyzing historical data and identifying trends, the system can predict potential performance degradation before it impacts users. Alerts are configured to trigger based on thresholds for IOPS usage, CPU spikes, and slow query logs, among other critical metrics. These alerts are integrated with Qonto’s incident management system, ensuring that the right teams are notified immediately and allowing them to take preventive rather than reactive measures.

Navigating Alerts with Confidence: The Role of Runbooks

Recognizing the importance of clear and actionable responses to alerts, Qonto incorporated detailed runbooks into their monitoring framework. These runbooks provide step-by-step guidance on addressing specific alerts, including instructions on scaling IOPS, optimizing queries, and adjusting RDS configurations. The goal is to minimize downtime and ensure the engineering team can confidently navigate alerts, reducing the mean time to resolution (MTTR) and maintaining optimal database performance.

Engaging the Community: Open Source Contributions and Future Development

In the spirit of collaboration and continuous improvement, Qonto decided to open-source its Database Monitoring Framework. By sharing their solution with the broader tech community, they aim to empower other organizations facing similar challenges with RDS monitoring. Qonto actively encourages contributions from developers and database administrators, inviting them to enhance the framework with new features, integrations, and optimizations. The future development roadmap includes plans for more advanced predictive analytics, support for additional AWS services, and deeper integrations with popular DevOps tools.

Conclusion

Qonto’s journey from facing SQL query latency challenges to developing an open-source monitoring framework inspires tech teams everywhere. Their Database Monitoring Framework addresses the immediate needs of their infrastructure and contributes to the larger community by providing a powerful tool for enhancing AWS RDS performance and alerting capabilities.

References

Monitoring metrics in an Amazon RDS instance

Monitoring OS metrics with Enhanced Monitoring