There have been two related incidents in which the Luscii platform went down temporarily (30 min and 20 min) for all users because of a database migration that added a foreign key to the measurements table (which has a lot of rows). The platform was quickly restored by rolling back and/or it recovered by itself. No alerts where missed. Users might have received errors when sending in measurements, and they could send in the measurements again when the platform was restored.
Incident #1 April 10, 2024
Date and time | What happened |
---|---|
13:57 April 10 2024 | Release of the application code with the problematic migration script |
14:15 | First reports of people not being able to login |
14:18 | First line escalated to 3rd line / PIC |
14:22 | PIC nudges 3rd line |
14:23 | 3rd line confirms that he problem has been seen and start investigation. |
14:31 | Roleback to previous stable version started |
14:38 | Roleback completed, waiting containers to start |
14:45 | Vitals API back online |
Lead time: 15 min
Work around time: N/A
Correction time: 15 min
Incident #2 April 24, 2024
Date and time | What happened |
---|---|
10:51, April 24, 2024 | Ran problematic SQL query |
10:53 | Support including third line is informed about the issue |
11:05 | Instatus update posted |
11:09 | Servers seem stable again |
11:13 | Cause found to be a query run by one of the developers |
Lead time: 0 min
Work around time: N/A
Correction time: 22 min
Entire Luscii platform was down for patients and clinicians, first for 30 minutes, and later for 20 minutes.
Classification: High
N/A
Deploy of a migration of the database (adding a foreign key column to measurements table) caused a lock of the measurements table which caused the containers of the Vitals API application to fail. In turn, this caused the entire platform to be unavailable to users.
When retrying the migration manually, it did not break the containers of the Vitals API application but it did lock the measurements table for 20 minutes, causing the platform to become unavailable to users.
Redeploy previous version of the Vitals API application code which did not have the problematic migration and manually set the migration to succeeded (so it would not be executed again).
After retrying manually, the platform became available again after the measurements table was updated after 20 minutes automatically.
External
Internal