Requests to the clinical engine were failing due to a high load. The high load was caused by an update to the Postgres database, which worked well on test and acceptance but caused a change in the performance on the backend on production.
Date Time | Description |
---|---|
05/02/2024 03:16 | Postgres upgrade on the database cluster |
05/02/2024 08:49 | Investigations started because of write to disk AWS alarms |
05/02/2024 10:00 | Service unavailable resulting in caregivers also experiencing issues |
05/02/2024 10:22 | Issue identified as load on the database cluster |
05/02/2024 10:45 | Scaled up the database cluster to deal with the increased load |
05/02/2024 11:12 | CE services restarted to try and minimize load |
05/02/2024 11:27 | Errors resolved, system stable again. |
05/02/2024 12:24 | Jobs that have failed on retry have been manually repushed in the queue to be retried |
Lead time: 0.5h
Work-around time: N/A
Correction time: 2.5h
SLA was met.
Impact considered high retrospectively because there was negligible risk to patient safety since the issue was resolved quickly. No data was lost. Measurements were manually send into the clinical engine when the issue was resolved. There was significant inconvenience to customers though.
N/A
Reader database had too much CPU utilization after a database upgrade.
Scaling up the database cluster resolved the issue. Manually retrying the failed jobs made sure that no data was lost.
External
Internal