DataOps Platform Outage May 25, 2022 - Root Cause Analysis

Problem started: Wednesday 25th May, 2022: 01:16 - 01:53am UTC and 02:36 - 04:54am UTC
Logged by: 24/7 support team
Problem solved by: Wednesday 25th May, 2022 04:54 UTC

Problem statement

On Wednesday 25th May, just after 1am UTC our 24/7 support team started to see alerts from our monitoring system that one of our nodes was running low on memory and subsequently went offline at 01:16. They took the necessary action to bring the node back online at 01:53.

At 02:36 the node went offline again after which our 24/7 support team contacted the engineering team for further investigation. After the initial assessment the decision was made to make a backup of the platform for root cause analysis, pause optional backend processes and restart. This brought the offending node back up at 04:54 and the remaining system healed itself. The engineering team stayed online to monitor the performance but no other performance issues were observed.

Investigation

Investigating this issue found that there were a number of backend processes that contributed to this issue:

On the morning of the 24th May we deployed a major upgrade to our platform. DataOps.live knew this upgrade would increase the overall memory usage on the node during normal operation. All prior testing showed though that to be within expected levels.
As part of the upgrade process DataOps.live knew there was going to be a long background process running on the platform. Our testing of the upgrade did not show any significant additional memory usage on our test system.
Further, the regular platform backup started at the same time the upgrade background process was completing.
Other background processes running on effected node contributed to requiring more memory than usual.

The overall impact of the increased memory usage from the the upgrade itself, the upgrade background processes, and the regular backup resulted in out of memory.

Resolving the first incident followed standard practices to bring DataOps.live back up. The repeat incident required further assessment to ensure the node was brought back safely with no further downtime in the future.

Resolution

DataOps.live resized the affected node to the new resource requirements. Continuous monitoring remains in place and is reviewed regularly.

Learnings & next steps

Although DataOps.live extensively tested this major upgrade, we came across some edge cases that were not considered. The procedures for the next major release are being updated to reflect this scenario.
Additional monitoring is being put in place with stricter limits to give forewarning of similar situations in the future. For future upgrades, DataOps.live 24/7 support team will escalate to the engineering team immediately on any notifications coming from the monitoring system.
In addition, for a future major upgrades, the engineering team will proactively study application monitoring for 24 hours directly, to ensure it has run through all its standard processes at least once.

Problem statement​

Investigation​

Resolution​

Learnings & next steps​

Problem statement

Investigation

Resolution

Learnings & next steps