A bug in creating planned actions caused a big performance problem, leading to delays in syncing alerts from the Clinical Engine to the Vitals API.
Each action planned by a workflow was duplicated, creating two actions instead of one. When there was a workflow on an overdue alert (which is common), the planned actions for the next day doubled. This caused a rapid increase in planned actions, leading to delays in the alerts queue over a few days. Some patients (mostly demo patients) had hundreds of actions per day. No alerts were lost, but they were delayed.
Date and time | What happened |
---|---|
May 21, 2024 9:18 AM | Development team notices the increase in load on the servers and investigation starts. |
May 21, 2024 11:30 AM | Pinpointed on which date the issues started to increase significantly. |
May 21, 2024 4:31 PM | After different fixes and changes, a possible culprit was found in users with extreme amounts of planned actions, mostly test patients. Clean up of these users was started manually. |
May 21, 2024 4:59 PM | No new timeout errors were noticed and load seemed to stabilize. Alerts/measurements etc. in the queue were all processed correctly. |
May 22, 2024 9:24 PM | Investigations by developer to find a way of decreasing the amount of already created planned actions that are duplicates. |
May 22, 2024 11:31 PM | After noticing the extreme amount of overdues resulting in the same amount of planned actions to be planned for the next day due to workflows, we changed the max trigger for workflow actions that plan action to be limited to max=10. |
May 23, 2024 6:12 PM | Release was done which removed changes around workflows, and recalculate was triggered. |
May 23, 2024 6:42 PM | Not all actions were correctly deduplicated, so work on a solution started to deduplicate the rest of the users with problematic data. |
May 24, 2024 12:12 AM | After frequency deduplication, a developer removed planned actions for the coming week that were duplicates. |
Lead time: 12 minutes
Workaround time: 7 hours
Correction time: 68 hours
SLA was met
We did not lose any data, but there was a delay in showing alerts, which led to a low patient safety risk. We estimate that on May 21st, we had a couple of hours of delay on average for showing alerts in the nurse dashboard. Nurses could see that alerts were generated, but they could not see the details of the specific alerts.
Patients could do measurements normally. Some patients saw a (much) longer list of planned actions, which might have caused some inconvenience.
Classification: Critical
We increased the number of queue workers and closed demo patients with many planned actions. This greatly reduced delays during the day. Only night-time delays remained.
On the 14th, we did a release that resulted in planned actions through workflows not showing up in the nurse's calendar. To fix this, we created a hotfix release a day later that actually resulted in the bug that triggered duplicate planned actions. The planned actions triggered workflows that would trigger new duplicate planned actions the next day, leading to a cycle of rapidly increasing planned actions for patients. Some patients had up to 1200 planned actions per day after a couple of days.
When syncing alerts generated by the clinical engine to the Vitals API, a recalculation of the planned actions is triggered. Because of the many planned actions, this led to delays in syncing the alerts. Eventually, all the alerts got synced, but this could take up to 12 hours during the peak of the load.
The original change was not reviewed or tested accurately enough. This could have been caused by pressure to release at the end of the cycle and by the tester also being busy with another team.
The hotfix change was tested and the duplication issue was found but deemed harmless enough to ship to fix the nurse's calendar incompleteness.
Part one of the solution was to roll back the release and remove the generation of duplicate planned actions from the code so the situation would not decline further.
Part two of the solution was to identify incorrectly generated planned actions and frequencies and delete them to deduplicate them. We were able to remove everything for the current week and we are still going through deduplication for further in the future. We also still need to run deduplication for the near past.
External
Internal