Production Issue Summary
There have been two related incidents in which the Luscii platform went down temporarily (30 min and 20 min) for all users because of a database migration that added a foreign key to the measurements table (which has a lot of rows). The platform was quickly restored by rolling back and/or it recovered by itself. No alerts where missed. Users might have received errors when sending in measurements, and they could send in the measurements again when the platform was restored.
Timeline of Events
Incident #1 April 10, 2024
Date and time | What happened |
---|---|
13:57 April 10 2024 | Release of the application code with the problematic migration script |
14:15 | First reports of people not being able to login |
14:18 | First line escalated to 3rd line / PIC |
14:22 | PIC nudges 3rd line |
14:23 | 3rd line confirms that he problem has been seen and start investigation. |
14:31 | Roleback to previous stable version started |
14:38 | Roleback completed, waiting containers to start |
14:45 | Vitals API back online |
Lead time: 15 min
Work around time: N/A
Correction time: 15 min
Incident #2 April 24, 2024
Date and time | What happened |
---|---|
10:51, April 24, 2024 | Ran problematic SQL query |
10:53 | Support including third line is informed about the issue |
11:05 | Instatus update posted |
11:09 | Servers seem stable again |
11:13 | Cause found to be a query run by one of the developers |
Lead time: 0 min
Work around time: N/A
Correction time: 22 min
Impact
Entire Luscii platform was down for patients and clinicians, first for 30 minutes, and later for 20 minutes.
Classification: High
Workaround
N/A
Cause
Deploy of a migration of the database (adding a foreign key column to measurements table) caused a lock of the measurements table which caused the containers of the Vitals API application to fail. In turn, this caused the entire platform to be unavailable to users.
When retrying the migration manually, it did not break the containers of the Vitals API application but it did lock the measurements table for 20 minutes, causing the platform to become unavailable to users.
Solution
Redeploy previous version of the Vitals API application code which did not have the problematic migration and manually set the migration to succeeded (so it would not be executed again).
After retrying manually, the platform became available again after the measurements table was updated after 20 minutes automatically.
Communication and documentation
External
- InStatus updates where used for both issues.
- Users who contacted us directly where answered privately.
- A post mortem was written. You are reading it.
Internal
- Internal communication through Slack was done as usual.
Improvements
- Remove database write access from Developers and clarify process for making database changes (CAPA-3)
- Complex database migrations are communicated upfront to customers via InStatus (DONE)
- Process is updated so complex database migrations are not executed during working hours (CAPA-3)
- Investigate why old containers where stopped and if we can prevent this from happening in a similar scenario in the future (CAPA-4)
- Investigate cost of acceptance environment similar to production (CAPA-2)
- Show inStatus widget on login page, especially when there are errors trying to log in (CAPA-3)
- Document how infra monitoring is done so new developers have access to monitoring results (CAPA-3)
- List of materials all new developers should learn when onboarding (CAPA-3)
- Clean up alarms in AWS (CAPA-4)