Finding Root Causes in Distributed Software

Distributed software is a system that breaks down into multiple contexts, and when a problem arises, you’ll have to consider which context the problem is occurring in to solve the issue. There are 4 kinds of software problems you may have to solve when orchestrating distributed software:

  1. Simple

  2. Complicated

  3. Complex

  4. Chaotic

Distributed software has simple components, where the rules are known and solutions to problems are obvious. For example if your source code fails a unit test, then your actions to solve that problem are straightforward. There are also complicated problems. These have multiple solutions, which an expert can identify in short time. For example if an unexpected input from a user causes a system crash, then there are multiple ways to solve the problem. Finding the root causes of simple and complicated kinds of problems is often intuitive to those with expertise.

Complex problems are where things get interesting. A complex problem in distributed software is one in which there is a solution, but there is no established practice for solving it. And the solution can only be recognized in hindsight. The sky is the limit as far as the kinds of complex problems that you might encounter in software. In this case, the best idea is to probe the system to identify a pattern that is causing a problem. To solve a complex problem, you’ll have to experiment to find a solution.

Additionally you may have chaotic problems where no solution exists, and only imperfect actions can be taken. For example if you have a security breach and sensitive data is being leaked, or your software is attacked deliberately.

Keep in mind these points when searching for root causes of a software problem:

  1. To identify the root of problems you need both expertise and information. Once a problem has occurred, you only can work with the information you have available, so set up easy access to test results, metrics and logs ahead of time. DataDog is one platform for collecting this kind of data.

  2. Causes precedes effects. Identify the point in time at which an issue started, and then ignore everything that happened after that point in time.

  3. Problems can be obvious, complicated, complex, or chaotic. Try the Cynefin sense-making framework to determine what kind of problem you have to solve.

https://thecynefin.co/about-us/about-cynefin-framework/

https://hbr.org/2007/11/a-leaders-framework-for-decision-making

XML vs. Database for Python Configuration Management

Question: I’m a Python Developer, and I’ve developed a web application that identifies the make and model of a car in the image that a user uploads. I’m at the point now where I have to worry about Configuration Management. I’m considering using either a relational database or an XML file to store credentials, blacklisted users, file size limits, etc. Which choice is better?

Answer: Using one or more XML files (and other text-based options like YAML, INI, and JSON) versus a relational database are very different choices when it comes to defining your software configuration.

My suggestion is to make XML (or another text based option) the first choice, and to use a database only if the XML configuration file is so heavy that it affects performance or readability. XML, YAML, and INI files have the insuperable benefit of allowing you to include comments in the configuration file, and of being human readable. Including comments is not possible if you go with a database. 

That said, relational databases have the upper hand when it comes to user access control, and to storing/reading records of how the software was configured in each deployment. Using a relational database may be the better choice if you need to control who can read from or write to your software configuration. Also, a blacklist of users for example, would be more intuitive to store in a relational database because XML does not support arrays.

By Matthew Hawkins

Why Docker Is Easy to Love

As of June 2022, the headline on Docker’s website is “Developers Love Docker. Businesses Trust It.” Docker empowers developers, and that it is the core reason that it has been so widely adopted since its founding in 2008.

According to Statista, there are 24M software engineers worldwide (1). A tiny (very tiny) fraction of those developers build software from scratch which doesn’t rely on an existing operating system or on external libraries. The vast majority of developers are fleshing out the existing software universe at its very edges. New software projects are more like adding a pool to the top of an existing skyscraper, than building a skyscraper from the ground up. Developers create new applications by creating a unique assembly of existing software, which requires a dash of new code. This is great for both developers and for consumers.

This way of doing things does have challenges though when it comes time to move software from where it is developed to where it will be used. Sticking with the skyscraper analogy: it’s tough to move the pool that you built on one skyscraper to the one down the street. Developers manage this problem by carefully tracking and managing the environment that the software is developed in. In other words, they create a pool that can be packaged up and moved to another location with relative ease.

Docker takes a different approach: just package up the entire skyscraper and clone that to another location. Docker, which makes available reusable software “containers”, abstracts away the details of creating the ecosystem that software needs to survive. This approach also makes it simple to host multiple software applications with conflicting dependencies on the same operating system.

The only downside to Docker is that there is an immense amount of duplicated data between containers. But in a world where you can purchase a petabyte of data storage at the same cost as paying for a year of software development work, it makes far more sense to duplicate libraries than it does to have developers duplicate work.

By Matthew Hawkins

(1) https://www.statista.com/statistics/627312/worldwide-developer-population/

Develop Distributedly, Deploy Remotely

Remote deployment – deployment of software in the cloud – creates a uniform and improvable experience for end users.

When it comes time to deploy your software, it’s a moment similar to when a new graduate enters the workforce – it’s time for all the preparation to pay off. Developers are the people who have to think about deployment, but deployment is all about putting the software to work for the user.

If you are considering whether remote deployment is the best option for your project, think about this question: Why has the software business exploded? There’s no one-liner explanation, but there is one fundamental business advantage that is the primary cause: scalability. Software business scale because the marginal cost of serving every additional customer is practically $0. 

This axiom isn’t true for every software though. The marginal cost is zero only if the software service or experience requires no customization for each user. While it is possible to minimize the customization required to deploy the software locally on each user’s hardware, remote deployment offers the promise of zero customization and therefore zero marginal cost.

The goal of many software products is to provide a service or an experience to an end user, and the uniformity of the experience is an important feature that users want. Remote deployment meshes nicely with that goal because hosting is removed from the user’s hands. 

The second reason that Software As A Service businesses should deploy software remotely is because it enables continuous integration, i.e. continuous improvement. When a customer buys a software subscription, the subscription is a lot more valuable when the software is being continuously improved and continually updated to work with new hardware. This is easiest to accomplish with remote deployment.

Galileo Would Have Tested His Code

Experimentation is the fastest way to fail. Failing is the fastest path to success.

In my opinion, building something that works requires that you allow feedback into the building process. Writing source code without writing tests is like trying to hit the nail on the head with a blindfold on.

Software developers need to be good scientific thinkers because development requires one to do both theory and practice. Building the source code for an application is like writing the theory for how that application will work. Testing creates the essential feedback that allows you bring your theory in line with reality. 

You can probably guess that I think using automated testing is an incredibly valuable tool in software development (Yup, I do). Here are a few reasons why:

  • Tests themselves are a form of documentation for the code. 

  • Automated testing makes it possible to accept contributions both large and small.

Automated testing opens the door for individuals who only have knowledge of a small corner of a software project to make a contribution. The automated testing provides a filter for poor contributions, and thus allows software to accept contributions from anyone. This can be an important success factor for an open source project.

  • Automated testing allows developers to move on to creating new features. 

Once the automated test suite is written for a set of features, there is a safety net in place around those features. At that point, the developer doesn’t need to remember what the code even does – as long as the tests are passing, it is working.

By Matthew Hawkins

Freud Would Appreciate Regression Testing

In Freudian psychology, a common example of a “regressive behavior” is when a child reverts to earlier behaviors when a sibling is born as a means to gain attention. Just like a person, software can also “regress” as it is developing. This happens when a feature which was previously functioning is broken when the source code or environment changes – like when a six year-old “forgets” how to tie their shoes because Mom is now tying someone else’s shoes.

Luckily though, there exists a sort of talk therapy for software that can prevent regression, and it is appropriately called “Regression Testing.”

The point of Regression Testing is to verify that features of a piece of software which were working in prior versions are still working after the source code and software environment has changed. This kind of testing is built on top of unit, integration, and system testing.

The lines that separate different kinds of testing are somewhat blurry, but Regression Testing focuses on business outcomes. For example, if you imagine a web app that enables users to send group messages to each other, a regression test would evaluate if adding authentication to the application has broken the group messaging feature.

Regression testing is a practice that is useful for almost every project, but it is particularly useful on projects that are tightly-coupled, or monolithic. A project may require less coverage for regression testing the more it is modularized. This is because modularized code can be tested thoroughly earlier in the testing process, e.g. during unit testing.

By Matthew Hawkins

Git at the Indy 500

When I think of Git, I get this image in my head of a pit crew at the Indy 500. The speed and precision that a pit crew brings to the task of changing a race car’s tires, and fueling it up for another 50 laps is a thing of beauty. If what is possible in the software realm were possible in the hardware realm, it would be truly stunning to behold. 

Here’s what that might look like. Imagine that at the very moment the race starts, and drivers put the pedal to the metal, a machine, which looks like a death ray, zaps car #1. When the flash of light recedes, you see that the machine has created a perfect, atom-for-atom clone of car #1. And while car #1 speeds off towards the first turn, the pit crew immediately starts building a bigger engine and better spoiler for the clone. Eventually car #1 comes in for that pit stop, and instead of the pit crew gassing it up, the driver slips out the window and drives off in the souped-up clone for the next 50 laps. 

Then the process starts again, only this time the clone is zapped. And zapped again! Creating two sub-clones. The pit crew gets working on the “develop” clone and the other “feature” clone is whisked off to the wind tunnel lab for body shape experimentation.

This kind of cloning and simultaneous work is what Git makes possible for software development, and it is the envy of those stuck working in the physical realm.

The Git Version Control System allows teams to make any number of clones of a piece of software, and to apply the changes they make on the clones into the original copy in a piece-by-piece fashion. Sticking with the race car analogy, this would be like having the ability to take the spoiler from the “develop” clone, and the body shape from the “feature” clone and instantly assemble them into a new production race car.

Further, Git makes it possible to identify places where two teams make changes to the same piece of code, through merge conflict identification. And makes it possible to fast-forward clones which are lagging behind the production system through the process of “rebasing” and “merging.”

Cachow!

A World Without Version Control

One huge benefit of the widespread adoption of Version Control Systems like Git is that it is possible to use open source libraries even in mission-critical software. Here’s how that looks:

Brandon is a developer, and he is concerned about robustness. And rightly so, because Brandon’s software monitors a hospital’s electrical system -- failure of that software might literally have deadly consequences. So, when he was considering adding a new feature which would send a notification to the hospital administrators through Microsoft Teams when a failure is detected, he had a conundrum.

Here’s the issue: writing that feature from scratch will take up a lot of time because Brandon doesn’t know how to hook into Microsoft’s Teams software. Luckily though, there is a Python package on PyPI already written which would make this feature easy to implement. The issue is that the package was released two years ago, and there are no signs that it is being maintained.

What if the package is poorly written and fails in unexpected ways? Or Brandon’s software stops working when the hospital upgrades to the next version of Microsoft Teams?

Should he write this feature from scratch and roll it into his all-in-one application? Or should he hook up to that Microsoft Teams module and let it handle those details?

Well, I wouldn’t recommend a particular strategy without more information. But, what I can say is that this situation would be very different if Brandon did not use a Version Control System. The fact Brandon tracks the changes to his software and manages his libraries carefully gives him lots of options.

Because changes to the hospital’s monitoring software are tracked (and so are changes to Microsoft Teams, and the open-source module), different versions of the software can be built to fit into different environments. Version Control makes it possible to have robustness even when your software has many, many dependencies.

By Matthew Hawkins

The Software Spacesuit

Question: I want to build a new piece of software, and I want to ignore the issue of library management. Can I do it?

Answer: You can do it…but, it guarantees a short life for that new software.

Library management is how you ensure your software can live outside of the “laboratory.” When your code moves outside of your development environment, it’s like a man stepping out of the space shuttle airlock into free space. Without a spacesuit, failure is inevitable. Library management is like designing a spacesuit for your code so that it can survive outside of the development environment.

Why is this so? Consider this scenario: You have a bit of code that is running perfectly on your development machine, and you’re ready to deploy it on 10 virtual machines. The only catch is that these virtual machines have an older version of, say, an SSL library. Maybe your code tries to make a HTTPS request, and promptly fails. It could be quite a project to untangle what’s happening and solve that issue.

It starts to get really tricky once you imagine multiplying this problem to more libraries and more virtual machines. Manually tracking of all of the dependencies in every software environment is simply not possible.

Therefore, developing your code with library management in mind, i.e. defining the environment that your code needs in order to run, is a must.

By Matthew Hawkins

New Software’s First Ingredient is Old Software

Writing a new piece of software is a little bit like coming up with the recipe for a new hamburger. You probably don’t raise the cattle for the beef, and you might not grow the wheat for the bun, but that doesn’t mean you can’t come up with something novel. You might whip up your own barbecue sauce to put on top, et voila! You have a new creation.

Like new burger recipes, new software is mostly a novel assembly of existing ingredients. When we identify a problem that can be solved with software, the new code we write is a way of building existing code libraries into something that solves the problem.

So, let’s think about how we can mix existing libraries into a problem-solving recipe.

Existing libraries can be used statically or dynamically. In other words, libraries can be referenced in our new work, or copied into our new work.

In the case of Python all external libraries are used dynamically. For example, when I want to assemble raw numbers into a table, I download, install, and reference the classes from the "pandas" package in my code to do that. When I run the program, everything functions because Python and pandas are both present on my machine. My new code and pandas are therefore both available to the interpreter.

However, if I move my code to a new machine, it might not function. If pandas is not installed on that new machine it will fail. Furthermore, if the exact version of pandas that I used on my other machine is not present, my code might fail.

Using a library statically, can avoid this problem. For example, if I copied the entire pandas package into my new software, I could avoid some problems when I move my software to new environments.

There are trade-offs when using external libraries in either fashion. Among the key considerations when making the choice are longevity, portability, and speed.

By Matthew Hawkins

When You Need to Move to the Cloud

It might be contrarian for me to say this…You do not, in every case, get more value by running your software in the cloud. 

Let me explain with a counter-example. In many enterprises there is one software application being crushed by the weight of half a billion semi-pro software developers -- Microsoft Excel.

Let's say that you use a macro-enabled Microsoft Excel workbook to pull data from a company database, and use the data to visualize your metric of choice. Moving that workbook to Microsoft's Office 365 cloud might seem like a savvy move. After all, doing so allows you to share that workbook and access it from anywhere in the world. However, moving the software to the cloud may not actually result in more value.

Why not? There is some activation energy -- some setup and ongoing behavior change -- required to move from an existing process that uses software, to a process where the software is cloud-based. 

However, there is (almost) always a performance benefit to running software in the cloud. And (100% of the time) there is an efficiency benefit to running software in the cloud.

Here's another example. What deployment option is best for this job? Running a 10-minute web scraping job that accesses 20 webpages and sends out an email. In my opinion you would be better off utilizing a local virtual machine (or heck, your Windows laptop) than setting up an AWS instance to run that job.

You need to move to the Cloud when your software needs to scale. The need for your software to scale can come from several directions:

 - The need to allow more people to use the software

 - The need to accept a larger input

 - The need to run the software more often or multiple times concurrently.

If you do need your software to scale, then the Cloud is in your future.

By Matthew Hawkins

What do Virtual Machines and 3D Printers have in Common?

Picture in your mind a 3D printer. It sits in front of you, ready to generate one of an infinite number of possible forms.  It is a collection of all the hardware, materials, and smarts required to produce whatever it is you can imagine. All it needs are a set of instructions.

Similarly, a virtual machine is an entire computing architecture, modularized, and ready to execute whatever instructions you can think up. 

Let's say you have a VM that is set up to take in image files, and output a black and white version of that image. You might use this VM by logging in to it, downloading an image from Google, and viewing the black and white image in the VM's photo application.

Wouldn't this VM be a lot more useful if it could receive images from any machine on its network? And be still more useful if it could send the black and white images back to other machines on the network?

Luckily VMs can do this, and this example is just one of infinite possibilities for distributed computing. The ability to receive instructions (and more raw materials) and share the results in the form of files is a core function of VMs.

By Matthew Hawkins

Aren't Virtual Machines Great?

In the era of remote work, Virtual Machines (VMs) should be uber popular. I’m surprised that they aren’t ubiquitous in corporate America. Virtualization makes more efficient use of computing resources–period.

There have been too many times in my life where I’ve used a remote computer and burned an hour or two because of a network issue at the remote facility. At some point it always seems like an actual person has to flip an on-switch.

Consider this situation: a remote employee connects to her company’s VPN and uses Windows Remote desktop to log in to a Dell workstation in New York City. She works away for several hours and is really in the zone when suddenly the RDP session freezes up. Someone forgot that the computer power cable is connected to an electrical circuit which automatically shuts off if there is no motion in the office for 30 minutes. (If it happened to me, it could happen to you).

This is a great case where a virtual machine has advantages over a physical machine. From the user’s perspective, using Remote Desktop to access a virtual machine is no different than accessing her physical machine.

So why wouldn’t this company make more efficient use of computing resources by hosting many virtual machines on a set of servers rather than on individual workstations?


By Matthew Hawkins

Why your Dad Doesn’t Know How Google Works

Without the concept of an operating system, it's impossible to extrapolate one's every day experience with computing on a phone or laptop to the computing systems that run modern software. 

Most of us experience the power of computing by using software on our phones or laptops, and designers have made the experience of using these devices so seamless that they seem to work like a human body – everything is integrated. I don’t think about using my phone in terms like “my touch screen input is signaling the OS to launch my Gmail application program.” I think “I’ll check my email.”

Without the idea that an operating system is a standalone entity of a 4-piece computer system – Hardware, OS, Application Programs, and the all-important User – It’s hard to imagine how one would scale up a laptop to the system that makes Google.com continually available, everywhere on Earth. Disentangling the OS from hardware, the OS from applications, and even the User from any actual living person is essential to forming a picture of distributed computing.

So, if your dad doesn’t understand what DevOps is, perhaps a breakdown of “What the heck is an operating system?” is lesson #1.

By Matthew Hawkins

Let Linus Tech Tips break it down: