Debugging using the Scientific Method

Across my career, I’ve naturally taken a ‘scientific method’ approach to investigating issues in software: my degree is in Physics rather than Computer Science, and my first job (level 3 support) involved a lot of issue investigation and incident management. Over the years I’ve found this approach to investigating issues in software systems indispensable for me, and really useful for the other people that I’ve showed it to, so I thought I’d write about it here in case it can be useful for others.

While what follows is a fairly ‘formal’ definition of using the Scientific Method as a basis for the approach to solving software issues, I rarely follow it quite so formally: striving instead to follow the principles behind it.

What is it?

There are lots of definitions of the scientific method around on the internet. I particularly like the ones on worknik and wikipedia. If I were to sum them all up, I would probably land on something like this:

An iterative approach to understanding an observation about some phenomena. Starting with a hypothesis, you test it with a set of experiments, then analyse data to come to a conclusion about whether or not the hypothesis was correct.

I mentioned here that it’s an iterative approach. At the time of drawing conclusions I often find that my initial hypothesis needs refining, so I end up going around this loop a few times:

Observation /Question Hypothesis Experiment Conclusion

In the experimental sciences, an important accompaniment to this method is keeping a lab notebook.

While in school and during my degree, it was drilled into us how essential keeping a lab notebook is to the scientific process. I didn’t really get it at the time, but after applying the scientific method to debugging, I finally do. Just like in journalling, the act of writing down some of that stream of conciousness can help to clarify and organise thoughts; plus (and maybe more importantly) you’re building up a dataset you can refer back to later.

How do you do it?

Applying the scientific method to software problems simply involves iterating through these 4 steps:

  1. Observe

    First see what you can observe. Try to get as specific as possible about the problem and gather as much info as you can to inform the next step by asking some questions like these:

    • What lets us see that the issue occurs? Is there a particular log message to look out for or state that the database finds itself in?
    • What circumstances does the problem happen under? What events lead up to it? What state are things in? Can we get a set of steps together that can reliably reproduce the problem?
    • What do we know about the system? What don’t we know? What architecture and technologies does it use and how do the different bits communicate?

    Gather together as much info as you can, and write it down somewhere. I usually have a big markdown document or jira ticket where I can gather this sort of thing, to act as my lab notebook.

  2. Hypothesise

    Next, review your observations and try to come up with a few explanations that might explain why the issue is happening.

    Try to come up with 3-4 to start with, and try not to discount things too early. This can be a really good group activity as it’s a creative brainstorming kind of thing.

  3. Experiment

    Now you’ve got a few hypotheses, it’s time to test them. Choose the one that seems most likely (or the quickest to rule out) and design an experiment to prove or disprove it.

    It’s not always trivial to pick which hypothesis to tackle first (it get’s easier with experience), but it doesn’t matter that much in the grand scheme of things: this is an iterative approach. You’ll loop back to your list later if the one you picked doesn’t end up being the root cause, but with more information under your belt!

    Think of an ‘experiment’ here as any change to the system that allows you to understand if the root cause is what you hypothesised or not. This usually involves adding extra logging to part of the system you think could be the culprit, or stubbing something out to rule out a side effect. The key metric for a ‘good’ experiment is that the data that it allows you to gather can be used to show that your hypothesis is correct or incorrect.

    Don’t forget to add to your notes while you’re designing and running your experiment: How will it show that your hypothesis is correct/incorrect? What did you find surprising? What commands and queries did you run? What did they output? Take screenshots. It’s all about gathering data!

  4. Draw Conclusions

    Look at the results of your experiment: do they prove or disprove your hypothesis, or do you need to adjust your experiment to get a firmer answer?

    Write down your conclusion, backed by the data, and your next steps:

    • If you proved your hypothesis, you now know what you need to do to fix the issue, plus you have something you can use to prove that your fix is effective!
    • If you disproved your hypothesis, don’t despair: you’ve made a step in the right direction! You now know more about what the cause of your issue isn’t, and you have more information in general about it. Loop back around the cycle to step one and go round again.

Why do I like it?

I think applying the scientific method is one of the most powerful tools in my software debugging toolbox. Hopefully you can already see why it might be worth considering! Here’s 5 reasons that I like it:

It keeps me focussed on the problem

If I’m not careful, I can find myself going down massive rabbit holes if I just go spelunking into a tricksy issue. Applying the scientific method helps me to focus on one single likely cause at a time so that I keep making progress.

It keeps me objective & data driven

Supporting my hypotheses with enough evidence to justify them means I don’t make assumptions about what might be going on. Assumptions can lead to red-herrings or confirmation bias. Drawing the hypotheses from data helps to combat this.

I also find that it helps me to get to the bottom of a problem: I understand the cause and can apply a fix I’m confident in, rather than just throwing stuff at the wall and seeing what sticks.

It keeps me motivated

Sometimes a tricksy problem can get a bit intimidating or demoralising: There’s such a large set of possible causes that it’s not clear where to start. The structure and iterative nature of the scientific method helps combat that: it’s harder to get overwhelmed when you’re just trying to figure out whether or not your hypothesis is correct.

It’s useful when reporting

Some of the issues I’ve looked into over my career were high stakes, or just very important for someone. I’ve found that applying the structured, iterative approach described above to be helpful for keeping people happy (or if not happy, at least less angry!).

This is because you can talk rationally about what you’ve done and why you’ve done it. When you get hauled into a meeting to explain why the issue is still happening, you can talk about what you’ve ruled out, which is still progress! Being able to give useful status updates makes a big difference to stakeholders who are often themselves under pressure to show that progress is being made.

It encourages me to Write Stuff Down

I’ve found the practice of keeping a ‘lab notebook’ of sorts when investigating issues in software to be valuable for many of the reasons that people find journaling valuable:

Summary

I hope I’ve convinced you of the value of the scientific method’s Observation → Hypothesis → Experiment → Conclusion loop. I’ve found it invaluable over my career, especially when faced with a tricksy problem in a software system. It keeps me focussed on the problem at hand, helps me to be creative in coming up with potential causes, and allows me to keep momentum towards finding and fixing the root cause.

If you haven’t tried it before, perhaps try it out next time you have a problem to debug in some software!