fbpx logo-new mail facebook Dribble Social Icon Linkedin Social Icon Twitter Social Icon Github Social Icon Instagram Social Icon Arrow_element diagonal-decor rectangle-decor search arrow circle-flat
Consulting
November 17, 2020

How to Fix Critical Issues (Without Breaking Your Team)

Josh Linden Client Engagement

Uh-oh. Something suddenly isn’t working in your application. Something critical. Something really critical.

Sound the alarms! Wake everyone up! Make everyone work overtime until this is fixed!

…is what I’d recommend if you want to solve the issue inefficiently and burn everyone out in the process. Even when everything seems broken, keeping a level head and taking care of your team is ultimately the quickest way to resolve issues.

Easier said than done, right? Wrong. All you need is a little process! Luckily, there are 10 easy steps you can take to keep calm and carry on.

Step 1: Stop

…but for real, stop.

I know you don’t want to! I know it feels counterproductive to stop! But this is the number-one most important thing you can do when things go AWOL.

Stopping curbs your fight-or-flight response, which will help put you in the mindset to make rational decisions, rather than emotional ones. It also allows you to direct your nervous energy toward productive ends rather than releasing it chaotically toward your teammates and into the world at large.

But how do you pump the brakes quickly, especially when “taking a breath” doesn’t cut it? Here are some exercises that have worked for me:

  • Get up and walk away for 5-10 minutes. Take a quick walk around the block, take a shower, do your dishes: five minutes of moving your body and focusing on another manual task helps stimulate ideas and gain some literal and metaphorical distance from the situation.
  • Do something nice for yourself. You’re about to go into emergency mode, so you might as well make or buy yourself your favorite coffee/tea/snack.
  • Write down everything you’re feeling. Getting the frustration, fear, and anger out of your head and onto paper helps get past the initial emotions and on to productivity. (I recommend using sticky notes so that you can crumple them up and throw them out — it’s really satisfying.)

Step 2: Accept That You Are Where You Are

Acceptance, in this case, is the second step.

It’s important to shift your mindset from catastrophic to challenging. The current stressful situation might ruin your evening plans, but it isn’t going to ruin your life. You’re here, and now it’s time to work.

Also, if you’re anything like me, your instinct is going to be to beat yourself up for not avoiding this situation in the first place. This is me freeing you of that guilt. This kind of thing happens, even when you have best-laid plans.

Step 3: Stop Playing the Blame Game

Unfortunately, some people stop before this step. They point fingers and more time is spent trying to assign blame or protect one’s territory than fixing the problem.

Internalizing that everyone has the best intentions (including yourself) allows you to have a more positive and empathetic mindset as we progress through the next few steps. More often than not, collaboration is the key to getting out of your current predicament, and pointing fingers only makes people less likely to collaborate with you.

I’ve also been in plenty of situations where I had a hunch that the cause of a late-night or weekend critical bug hunt laid in someone else’s work, only to realize that it was something my team (or I personally) had done. The reality is you just don’t know what the cause is until you know, and by assigning blame before that happens you only antagonize others and potentially leave with egg on your face.

I do want to note that there is definitely a time and place for assigning accountability. But save those emotions and examples for later. (We’ll get there!)

Step 4: Make a List

For me, this is the make-or-break step you can take when the chaos is beginning to build. But you can’t do it right until you’ve taken Steps 1-3.

Making a list is critical for two reasons:

  1. It codifies everything that needs to be done. You go from “everything could be broken” to “this is what is wrong right now and the priority order” (which also implies “if it’s not on the list, it doesn’t matter right now”).
  2. It’s shareable and trackable. Everyone can get on the same page of what needs to be done and the status of each item.

I get a pen and paper, take it to a table separate from my workstation, and write down a list of everything that seems to be happening. Then I edit the list and prioritize it.

For others, making a diagram may be better. Some things to try:

  • Make a list of everything you know that needs to be done, then prioritize it on a 2×2 grid where one axis is “Important” and the other is “Urgent.” Then only do the Important and Urgent items before moving on.
  • Write down everything you don’t know in one column, then in a second column write down what steps you could take to know it
  • Draw a mental model of the issue

However you go about it, take the time to write your list. 10-30 minutes is enough — anything longer and you’re probably overthinking it.

Step 5: Be Clear About What You Need From Others

Once you have a list, think about who you need to work with to solve each of the issues. Do you need to get access to a system? Does the knowledge you need reside with only one person?

Make sure you can answer these three questions:

  • Who do you need? 
  • When do you need them? 
  • For how long?

Answering those questions allows you to get things done and not stress others out. Remember, if you’re stressed, they’re stressed! Having defined roles and expectations during remediation helps everyone stay calm.

Step 6: Use the Scientific Method

Finding a bug or solving a tricky problem requires discipline. You need to use the scientific method for each issue you’re facing:

    1. Have a hypothesis. Based on the information I know right now, what might be the issue?
    2. Run an experiment with only one variable at a time. It can be really easy to conflate causation when you have more than one thing going on at once. Test one thing at a time and test each iteration of it separately. Monotony leads to credibility!
    3. Record the results and iterate if necessary. Write down what you’ve tested and the results so that you can refer back to it later. It may take multiple iterations to find the issue, and then you’ll want to run your solution back through those same tests to make sure it works.

Step 7A: No Rock is Too Small To Pick Up

Okay, I’m cheating a little bit on this step, but these two substeps go hand-in-hand.

Imagine you’re tending to your garden. Suddenly, you notice there’s a rock out of place — how did a rock get into the azaleas? It’s far away, and your knees are tired, but you know you have to get up off your knees and pick it up or else there will be issues down the line.

Bug hunting is like that, except sometimes underneath the rock is a den of snakes. The key is to always pick up the rock, even if you’re tired, and especially if you think there might be snakes. You have to face the snakes.

It’s often the smallest things that cause the biggest problems. Even if the solution is a one-line code change or updating an environment variable, it’s not going to be obvious in the moment. The only way to know is to pick up the rock.

Step 7B: Foster a Culture of Experimentation

Picking up the rock only works when everyone thinks the rock is worth picking up.

In the heat of the moment people can slip into a defensive mindset. As you get closer and closer to the solution, the more likely it is that people will respond, “We’ve already tried that” or “That’s not related to the problem.” Maybe it’s because they are trying to protect themselves and don’t want the problem to be their fault. Maybe it’s because they are just tired, cranky, and/or hungry.

Regardless, a culture that shuts down ideas shuts down solutions. Using positive and descriptive language (ex. “I don’t think that’s going to end up being relevant for X reason”) allows everyone to hear the thought process. It may be true that it’s not relevant, but at least now everyone knows why. And if not, you might have just found your solution!

Keeping the spirit of experimentation strong keeps people positive and prevents them from slipping into the blame game again. (We already solved that in Step 3! Onwards!)

Step 8: Once It’s Working, Re-Test Everywhere

Phew! We have a solution! It’s working on our local machines and we’ve successfully diagnosed the issue!

Now it’s time to repeat all your tests from before, and in each environment. Good thing you made a list and followed the scientific method!

There are always goblins and gotcha’s lurking in different environments, so you can’t know whether something is working until the bug cannot be reproduced everywhere. You might be delirious or giddy, but I can’t tell you how many solutions have worked everywhere and then failed on Production for some other reason. The extra hour of testing will give you peace of mind, so take the time to do it.

Step 9: Go Rest

You and your teammates just expended a lot of emotional and mental energy. You need to step away and rest.

Make a list of next steps and then go do something nice for yourself. If you’re a manager or team lead, give your team the next morning or full day off.

Prioritizing time to rest and not think about work after a stressful remediation is critical to not burning anyone out. The time away also provides perspective that you might not be able to see in the heat of the moment.

Even if it doesn’t feel like you have the luxury to rest, believe me, you’ll lose far more than one day of productivity if everyone has to jump back into the trenches immediately.

Step 10: Do a Learning Review

Some time within a week or two after resolution, have a Learning Review.

I want to give a shout-out to my colleague Stephanie Minn for sharing the term “Learning Review” with me. I’ve always called these “post-mortems” in the past, but upon reflection that feels too negative. Nothing died nor do we need to grieve. We had a learning moment, and therefore we need to make sure we document what happened and how we’re going to change our process to prevent similar issues in the future.

Tandem’s goals for any Learning Review are:

  • Understand what happened, and why
  • Recognize what was helpful and worked well
  • Recognize what we can do to improve our systems, processes, and training
  • Share knowledge across teams

We achieve those goals in a way that assumes best intentions from individuals and without blame, but with an eye to accountability. As a result, we try to fix processes and systems to iterate towards fewer crisis moments in the future.


That’s it! 10 steps that are logically straightforward and sometimes difficult in practice. Following these steps will keep you and your team more sane and more on track.

Hopefully this can serve as a blueprint for the next time the alarm goes off (and it will). And if you want to work with a company that tries their best to prevent critical remediation issues from happening in the first place through automated testing, clear processes, and strong documentation, shoot us a message.

Related Blog Posts
Let’s do something great together

We do our best work in close collaboration with our clients. Let’s find some time for you to chat with a member of our team.

Say Hi