Post-Mortems on production issues

Prio 1: CE not registering newly activated patients

Geschreven door Daan Klomp | 10-apr-2024 9:46:05

Production Issue Summary

In the clinical engine, there was an inefficient query when associating active patients to a submitter. This resulted in CE giving errors whenever a patient is being activated and the activation not being registered in CE. This means the patient is active in vitals and can do measurements but since in CE the patient is not linked to the submitter, this won’t generate measurements in the clinical engine. This only affects patients that logged in for the first time in the period since the issue started occurring.

Timeline of Events

Date Event
@February 2, 2021 @Associate and dissociate operations are slow to the point of causing timeouts
@February 24, 2021 @Associate and dissociate operations are slow to the point of causing timeouts status: resolved - Dissociate is resolved
@March 27, 2024 14:00 Alarms in CE goes off, assumption is that automatic scaling will fix the issue
@March 28, 2024 9:57 Developers conclude CE being down
@March 28, 2024 10:16 Possible issue being investigated
@March 28, 2024 10:32 Issue found in code and starting to work on fix
@March 28, 2024 10:36 PIC was informed and process started
@March 28, 2024 11:42 Affected patients till that point identified
@March 28, 2024 11:52 Instatus update published
@March 28, 2024 14:42 CE needs a release instead of only hotfix since release pipeline was not in place after stopping to use octopus
@March 28, 2024 15:40 Testing CE release on acceptance
@March 28, 2024 17:09 Release deployed to production
@March 28, 2024 18:00 Retry of all failed associations finished
@March 28, 2024 18:07 Due to some format changes in the release, some retries related to scripted observations failed to run and can’t be fixed manually
@March 28, 2024 19:17 23 patients identified that were in the end affected because of the failed retries and therefore some calculations not don correctly, all other cases have been fixed
@March 29, 2024 18:04 Customers of the 23 affected patients have been informed by email, communication was a bit delayed because of lack of information from production issues communicator


Lead time: 18h

Work around time: N/A

Correction time: 9h

 

Impact

Patients that were activated after March 27th at 14.00h until March 28th, 17.08h were affected. 40 patients in total were affected from 23 organizations. All measurements that were sent in, were received by the clinical engine a bit later. Because this is on the first day the patients started, and there is no strict control normally over when a patient starts, we do not consider this issue to be of any clinical patient safety risk.

Classification: High

Workaround

N/A

 

Cause

The number of active patients being a limiting factor in the submitter association query. We got to a number of active users that was too much for this query with the number of parameters.

Solution

Optimize the faulty query.

 

Communication and documentation

External

  • InStatus was used and affected organisations where contacted directly as well

Internal

  • Normal communication in slack and according to process. Nothing special to report.

Improvements

 

  • Investigate long running queries (CAPA-3)
  • Improve sensitivity and specificity of monitoring of the clinical engine to reduce lead time (CAPA-3)
  • Clinical Engine Hotfix mechanism (CAPA-2)