Finding Root Causes in Distributed Software
Distributed software is a system that breaks down into multiple contexts, and when a problem arises, you’ll have to consider which context the problem is occurring in to solve the issue. There are 4 kinds of software problems you may have to solve when orchestrating distributed software:
Simple
Complicated
Complex
Chaotic
Distributed software has simple components, where the rules are known and solutions to problems are obvious. For example if your source code fails a unit test, then your actions to solve that problem are straightforward. There are also complicated problems. These have multiple solutions, which an expert can identify in short time. For example if an unexpected input from a user causes a system crash, then there are multiple ways to solve the problem. Finding the root causes of simple and complicated kinds of problems is often intuitive to those with expertise.
Complex problems are where things get interesting. A complex problem in distributed software is one in which there is a solution, but there is no established practice for solving it. And the solution can only be recognized in hindsight. The sky is the limit as far as the kinds of complex problems that you might encounter in software. In this case, the best idea is to probe the system to identify a pattern that is causing a problem. To solve a complex problem, you’ll have to experiment to find a solution.
Additionally you may have chaotic problems where no solution exists, and only imperfect actions can be taken. For example if you have a security breach and sensitive data is being leaked, or your software is attacked deliberately.
Keep in mind these points when searching for root causes of a software problem:
To identify the root of problems you need both expertise and information. Once a problem has occurred, you only can work with the information you have available, so set up easy access to test results, metrics and logs ahead of time. DataDog is one platform for collecting this kind of data.
Causes precedes effects. Identify the point in time at which an issue started, and then ignore everything that happened after that point in time.
Problems can be obvious, complicated, complex, or chaotic. Try the Cynefin sense-making framework to determine what kind of problem you have to solve.
https://thecynefin.co/about-us/about-cynefin-framework/
https://hbr.org/2007/11/a-leaders-framework-for-decision-making