Post-Mortems on production issues

Prio 1: Vitals API Down Incidents

Geschreven door Daan Klomp | 15-mei-2024 7:57:17

Production Issue Summary

There have been two related incidents in which the Luscii platform went down temporarily (30 min and 20 min) for all users because of a database migration that added a foreign key to the measurements table (which has a lot of rows). The platform was quickly restored by rolling back and/or it recovered by itself. No alerts where missed. Users might have received errors when sending in measurements, and they could send in the measurements again when the platform was restored.

Timeline of Events

Incident #1 April 10, 2024

Date and time What happened
13:57 April 10 2024 Release of the application code with the problematic migration script
14:15 First reports of people not being able to login
14:18 First line escalated to 3rd line / PIC
14:22 PIC nudges 3rd line
14:23 3rd line confirms that he problem has been seen and start investigation.
14:31 Roleback to previous stable version started
14:38 Roleback completed, waiting containers to start
14:45 Vitals API back online

Lead time: 15 min

Work around time: N/A

Correction time: 15 min

 

Incident #2 April 24, 2024

Date and time What happened
10:51, April 24, 2024 Ran problematic SQL query
10:53 Support including third line is informed about the issue
11:05 Instatus update posted
11:09 Servers seem stable again
11:13 Cause found to be a query run by one of the developers

Lead time: 0 min

Work around time: N/A

Correction time: 22 min

Impact

Entire Luscii platform was down for patients and clinicians, first for 30 minutes, and later for 20 minutes.

Classification: High

Workaround

N/A

Cause

Deploy of a migration of the database (adding a foreign key column to measurements table) caused a lock of the measurements table which caused the containers of the Vitals API application to fail. In turn, this caused the entire platform to be unavailable to users.

When retrying the migration manually, it did not break the containers of the Vitals API application but it did lock the measurements table for 20 minutes, causing the platform to become unavailable to users.

Solution

Redeploy previous version of the Vitals API application code which did not have the problematic migration and manually set the migration to succeeded (so it would not be executed again).

After retrying manually, the platform became available again after the measurements table was updated after 20 minutes automatically.

Communication and documentation

External

  • InStatus updates where used for both issues.
  • Users who contacted us directly where answered privately.
  • A post mortem was written. You are reading it.

Internal

  • Internal communication through Slack was done as usual.

Improvements

  • Remove database write access from Developers and clarify process for making database changes (CAPA-3)
  • Complex database migrations are communicated upfront to customers via InStatus (DONE)
  • Process is updated so complex database migrations are not executed during working hours (CAPA-3)
  • Investigate why old containers where stopped and if we can prevent this from happening in a similar scenario in the future (CAPA-4)
  • Investigate cost of acceptance environment similar to production (CAPA-2)
  • Show inStatus widget on login page, especially when there are errors trying to log in (CAPA-3)
  • Document how infra monitoring is done so new developers have access to monitoring results (CAPA-3)
  • List of materials all new developers should learn when onboarding (CAPA-3)
  • Clean up alarms in AWS (CAPA-4)