Prio 2: Alerts did not sync to Vitals

Author: Daan Klomp
mei 27, 2024

Production Issue Summary

A bug in creating planned actions caused a big performance problem, leading to delays in syncing alerts from the Clinical Engine to the Vitals API.

 

Each action planned by a workflow was duplicated, creating two actions instead of one. When there was a workflow on an overdue alert (which is common), the planned actions for the next day doubled. This caused a rapid increase in planned actions, leading to delays in the alerts queue over a few days. Some patients (mostly demo patients) had hundreds of actions per day. No alerts were lost, but they were delayed.

 

Timeline of Events

Date and time What happened
May 21, 2024 9:18 AM Development team notices the increase in load on the servers and investigation starts.
May 21, 2024 11:30 AM Pinpointed on which date the issues started to increase significantly.
May 21, 2024 4:31 PM After different fixes and changes, a possible culprit was found in users with extreme amounts of planned actions, mostly test patients. Clean up of these users was started manually.
May 21, 2024 4:59 PM No new timeout errors were noticed and load seemed to stabilize. Alerts/measurements etc. in the queue were all processed correctly.
May 22, 2024 9:24 PM Investigations by developer to find a way of decreasing the amount of already created planned actions that are duplicates.
May 22, 2024 11:31 PM After noticing the extreme amount of overdues resulting in the same amount of planned actions to be planned for the next day due to workflows, we changed the max trigger for workflow actions that plan action to be limited to max=10.
May 23, 2024 6:12 PM Release was done which removed changes around workflows, and recalculate was triggered.
May 23, 2024 6:42 PM Not all actions were correctly deduplicated, so work on a solution started to deduplicate the rest of the users with problematic data.
May 24, 2024 12:12 AM After frequency deduplication, a developer removed planned actions for the coming week that were duplicates.

 

Lead time: 12 minutes

Workaround time: 7 hours

Correction time: 68 hours

SLA was met

 

Impact

We did not lose any data, but there was a delay in showing alerts, which led to a low patient safety risk. We estimate that on May 21st, we had a couple of hours of delay on average for showing alerts in the nurse dashboard. Nurses could see that alerts were generated, but they could not see the details of the specific alerts.

 

Patients could do measurements normally. Some patients saw a (much) longer list of planned actions, which might have caused some inconvenience.

Classification: Critical

 

Workaround

We increased the number of queue workers and closed demo patients with many planned actions. This greatly reduced delays during the day. Only night-time delays remained.

 

Cause

On the 14th, we did a release that resulted in planned actions through workflows not showing up in the nurse's calendar. To fix this, we created a hotfix release a day later that actually resulted in the bug that triggered duplicate planned actions. The planned actions triggered workflows that would trigger new duplicate planned actions the next day, leading to a cycle of rapidly increasing planned actions for patients. Some patients had up to 1200 planned actions per day after a couple of days.

 

When syncing alerts generated by the clinical engine to the Vitals API, a recalculation of the planned actions is triggered. Because of the many planned actions, this led to delays in syncing the alerts. Eventually, all the alerts got synced, but this could take up to 12 hours during the peak of the load.

 

The original change was not reviewed or tested accurately enough. This could have been caused by pressure to release at the end of the cycle and by the tester also being busy with another team.

 

The hotfix change was tested and the duplication issue was found but deemed harmless enough to ship to fix the nurse's calendar incompleteness.

 

Solution

Part one of the solution was to roll back the release and remove the generation of duplicate planned actions from the code so the situation would not decline further.

 

Part two of the solution was to identify incorrectly generated planned actions and frequencies and delete them to deduplicate them. We were able to remove everything for the current week and we are still going through deduplication for further in the future. We also still need to run deduplication for the near past.

 

Communication and documentation

External

  • A post mortem was written. You are reading it.
  • InStatus was used to keep users updated about the issue. Because of the updates, we also got more questions because we could not give clear answers while we were still investigating.
  • A lot of users contacted us directly (at least 18); they have all received personal updates.

Internal

  • Slack was used normally.
  • For support, it was hard to stay up to date, and they could have been given more input throughout the issue.

Improvements

  • Improve cross-team test process where changes by one team can lead to issues in the boundary of another team (CAPA-2).
  • Flag patients with strange overdue patterns (e.g., patients that consistently don’t do part of their actions) (CAPA-3).
  • Default use the limit maximum number of runs for a workflow (CAPA-3).
  • Critical monitoring triggers also go to support if it is a very specific and serious issue (CAPA-4).
  • Generally prioritize rollbacks over doing hotfixes. Adjust the process for this (CAPA-3).
  • Review QA availability for the patient team (CAPA-2).
  • Improve product knowledge among developers so they know about the impact of the functionality changes (CAPA-3).
  • Production Issue Coordinator can create more clarity with simplifying Slack communication/different threads and keeping ongoing updates flowing for longer-running issues.
  • Review kill switches (CAPA-3).
  • Make planned action recalculation for new alerts async (CAPA-3).

Subscribe

Subscribe to our newsletter & stay updated