Issue description
Requests to the clinical engine were failing due to a high load. The high load was caused by an update to the Postgres database, which worked well on test and acceptance but caused a change in the performance on the backend on production.
Timeline of events
Date Time | Description |
---|---|
05/02/2024 03:16 | Postgres upgrade on the database cluster |
05/02/2024 08:49 | Investigations started because of write to disk AWS alarms |
05/02/2024 10:00 | Service unavailable resulting in caregivers also experiencing issues |
05/02/2024 10:22 | Issue identified as load on the database cluster |
05/02/2024 10:45 | Scaled up the database cluster to deal with the increased load |
05/02/2024 11:12 | CE services restarted to try and minimize load |
05/02/2024 11:27 | Errors resolved, system stable again. |
05/02/2024 12:24 | Jobs that have failed on retry have been manually repushed in the queue to be retried |
Lead time: 0.5h
Work-around time: N/A
Correction time: 2.5h
SLA was met.
Impact
Impact considered high retrospectively because there was negligible risk to patient safety since the issue was resolved quickly. No data was lost. Measurements were manually send into the clinical engine when the issue was resolved. There was significant inconvenience to customers though.
Workaround
N/A
Cause
Reader database had too much CPU utilization after a database upgrade.
Solution
Scaling up the database cluster resolved the issue. Manually retrying the failed jobs made sure that no data was lost.
Communication and documentation
External
- InStatus updates
- Manually answered all incoming support requests
Internal
- Regular communication channels worked well.
- A post mortem was written. You are reading it.
Improvements
- Setup more conservative load monitoring alarms (CAPA-1)
- Add DB CPU utilization to dashboard (CAPA-3)
- Enable performance insights in terraform (CAPA-3)
- Increase DB instance size - vertical scaling (CAPA-2)
- Automated DB scaling (CAPA-4)
- Plan for performance testing for complex new features (CAPA-3)
- Rate limit per endpoint in the Clinical Engine (CAPA-3)
- Enable sentry profiling for Clinical Engine (CAPA-4)