Post-Mortems on production issues

Prio 3: Alerts dashboard crashing for all caregivers

Geschreven door Daan Klomp | 15-mei-2024 8:05:42

Issue description

Requests to the clinical engine were failing due to a high load. The high load was caused by an update to the Postgres database, which worked well on test and acceptance but caused a change in the performance on the backend on production.

Timeline of events

Date Time Description
05/02/2024 03:16 Postgres upgrade on the database cluster
05/02/2024 08:49 Investigations started because of write to disk AWS alarms
05/02/2024 10:00 Service unavailable resulting in caregivers also experiencing issues
05/02/2024 10:22 Issue identified as load on the database cluster
05/02/2024 10:45 Scaled up the database cluster to deal with the increased load
05/02/2024 11:12 CE services restarted to try and minimize load
05/02/2024 11:27 Errors resolved, system stable again.
05/02/2024 12:24 Jobs that have failed on retry have been manually repushed in the queue to be retried

Lead time: 0.5h

Work-around time: N/A

Correction time: 2.5h

SLA was met.

Impact

Impact considered high retrospectively because there was negligible risk to patient safety since the issue was resolved quickly. No data was lost. Measurements were manually send into the clinical engine when the issue was resolved. There was significant inconvenience to customers though.

Workaround

N/A

Cause

Reader database had too much CPU utilization after a database upgrade.

Solution

Scaling up the database cluster resolved the issue. Manually retrying the failed jobs made sure that no data was lost.

Communication and documentation

External

  • InStatus updates
  • Manually answered all incoming support requests

Internal

  • Regular communication channels worked well.
  • A post mortem was written. You are reading it.

Improvements

  • Setup more conservative load monitoring alarms (CAPA-1)
  • Add DB CPU utilization to dashboard (CAPA-3)
  • Enable performance insights in terraform (CAPA-3)
  • Increase DB instance size - vertical scaling (CAPA-2)
  • Automated DB scaling (CAPA-4)
  • Plan for performance testing for complex new features (CAPA-3)
  • Rate limit per endpoint in the Clinical Engine (CAPA-3)
  • Enable sentry profiling for Clinical Engine (CAPA-4)