Issue
Database IO Issue
Date
August 1, 2018
Location
US
Length of Event
August 1, 2018 06:57 am - 07:20 am UTC
Summary
On 8/1/18 6:57 am UTC, some of the users of the US Harmony cloud experienced issues with user logins, scheduled jobs on multi-agent configurations, APIs, and/or an overall degraded performance. Jitterbit operations team detected reduced IO on the database tier. Duration of issue was 6:57 am-7:20 am UTC.
Root Cause
The issue was caused by disk input/output (IO) contention in the main database tier from a higher than usual backlog of long running queries, associated with month end reporting, and repetitive search queries, while concurrently running full nightly backups. The data services were fully restored once all of the tasks completed. During the event, some of the agents were not able to update status or retrieve jobs, and delayed Agent message routing prevented some of the real-time API processing from occurring. Our team will continue to investigate the type of queries that lead to additional strain on the system and has in place monitors to assess these situations more quickly. In the meantime, we are also actively working on making changes to how many of these queries are handled to further optimize performance and expect this change to be released over the next few months.
Steps for Remediation
- Adjust trigger time of the nightly portion of the backup job to avoid conflicts with month end reporting jobs.
- Investigate nature of long running intensive queries.
- Perform further tuning and optimization of the database performance.
- Segregate and isolate non-transactional data and related queries.