Lessons I’ve Learned from Pushing Bugs into Production

Anish Krishnan
5 min readJun 21, 2021

It doesn’t matter how experienced we are as programmers — we’ve all written bugs that get deployed. Whether it’s something small, like a broken link, or something more serious that results in large financial costs, it’s impossible to maintain a perfect track record in this respect. If you’re a newer developer with impostor syndrome suffering from guilt after starting a programmatic fire, just remember that the senior developers and SMEs helping you to put out that fire are good at doing so in part because they’ve started some of their own. I’ve been in software engineering for about seven years and I’ve deployed my fair share of application-breaking releases — and in the process I’ve learned some important lessons in (a) preventing them from happening, (b) managing them when they do happen, and (c) dealing with the aftermath once they’ve been fixed. Below are just some of the takeaways I’ve had over the course of my career.

  1. Release smaller increments of code more frequently. One of the reasons I prefer a more Agile approach to software development is that smaller deliverables are less likely to have errors. If we have to release a large, comprehensive feature with a two-month deadline, the probability of that code having at least one bug is fairly high. While a thorough code review and testing process can minimize the risk, there are still many variables to deal with. If we instead release a minimum viable product (MVP) in a shorter period of time and build upon it in subsequent iterations, we not only go through multiple code quality checks, but also allow ourselves to focus on smaller pieces of code. It’s easier to spot a bug when it isn’t hidden among a colossal set of code changes.
  2. Keep calm when you’re troubleshooting a production issue. It’s never fun to see production errors in the logs, or, worse, to be notified by a client that their application is suddenly down. But it’s extremely important to keep our cool if and when this happens, and to work with our colleagues to fix the issue. Expressing fear and frustration can hurt morale. Cursing the state of affairs won’t increase productivity. Instead, it’s best for us to stay focused, consider the variables that could have caused the issue, and look at different avenues to address the problem (e.g., rolling back, deploying a hotfix, restarting a failing Kubernetes cluster, etc.). We have to use whatever resources we have at hand, which brings me to my next point…
  3. Don’t be afraid to ask for help. As software engineers, we have a tendency to want to solve problems ourselves. Challenges are fun, and there are few things more satisfying than finally cracking a difficult puzzle. The problem is that while this approach is acceptable — perhaps even beneficial — in personal projects, it won’t fly when a company’s SLO guarantees 99.99% uptime and their application isn’t working. Now is the time to put aside our ego and contact any SMEs who may be able to offer us assistance.
  4. Consider your options. Decision making is a key skill when it comes to dealing with production bugs, and in particular that means knowing when to roll back, deploy a hotfix, or investigate the issue further. There are essentially three variables involved here. The first is whether or not we know what the source of the problem is. If we do, the second variable is how simple the fix would be. If it’s a small tweak to the code that can be quickly tested, then perhaps a hotfix is our best choice. The third variable is the error budget. Put very simply, this is how long we can get away with an SLO violation over a period of time. Say, for instance, that we had a previous outage that hurt our guaranteed availability for the month. We therefore have little downtime remaining in our error budget. In this particular case, a rollback is most likely the safest decision to avoid any contractual issues with consumers.
    With this in mind, is there ever a time when we can take time and investigate the cause of an issue without rolling back or deploying a hotfix? I would say that if it’s a bug that isn’t a showstopper, and is unlikely to cause inconvenience to consumers, then perhaps.
  5. Accept responsibility (but don’t go overboard). Building trust is integral to maintaining the strength of a team, and part of that trust comes from active ownership. While this applies to all engineers on a team, it especially applies to senior engineers and tech leads, who must set an example for the others. And ownership doesn’t simply entail taking the lead in developing deliverables; it also means being open about any bugs we’ve pushed to production. Accepting responsibility is key; it shows humanity and maturity and good communication. However, there’s a difference between accepting responsibility and apologizing profusely. The former lays the groundwork for a root cause analysis and identifying next steps; the latter can shatter our self-confidence.
  6. Identify other contributing factors and make plans to address them. The reality is that no single person is wholly responsible for a production bug. Sure, person A may have written the code, but persons B and C were responsible for reviewing that code (but maybe didn’t). Persons D and E were responsible for QA and UAT, respectively. Person F may be responsible for managing infrastructure elements that could explain differences between development, staging, QA, and production environments. The point is that people make mistakes; it’s important to consider the various elements in the system and how to improve them. For instance, the team could make code reviews and code coverage mandatory. They could also better communicate with the QA team to ensure that they have thorough information about test cases and test data. Improving the overall system makes it harder for a single engineer to push a bug through.
  7. Maintain good documentation. It’s important for the team to take good notes throughout the process. As soon as the production issue is detected, we need to maintain an up-to-date timeline of events, identifying specific error messages, attempted courses of action, and ensuing results, as well as any other auxiliary discussions/suggestions that occurred during the period. After the issue has been resolved, holding a blameless post-mortem/root cause analysis meeting to discuss what went wrong, how it was resolved, and action items to prevent it from happening again is vital — and this meeting should be thoroughly documented as well. Not only does it keep us focused and thinking ahead, but it also prevents us from pointing fingers (which is terrible for team morale) and serves as valuable information to any members of the team who join after the fact.

There’s no shame in admitting we’ve pushed a bug into production — every engineer has done it. What’s important is knowing ways to avoid it, deal with it as it’s happening, and reduce the likelihood that it will happen again. Earlier in my career, I was quick to simply shoulder the blame (or try and avoid it completely) without taking productive next steps. Experience has taught me that the release process is complex; it involves a system of many individuals, each of whom play an important role, and none of whom completely own a release. Recognizing that is the first step to deploying safely.

--

--

Anish Krishnan

Consultant, software engineer, cloud architect, news junkie, and arguably musician. https://linktr.ee/aykrishnan