About the author
Will Barrett is a Software Engineer, Technical Lead, and Engineering Manager from the San Francisco Bay Area with over 14 years of experience. He’s a Superset PMC Member of The Apache Software Foundation. He’s held staff software engineer and senior engineer roles at Change.org, Entelo, Sqwiggle and Preset.
Will is the author of On Learning to Program, a blog for new Software Engineers entering the industry. Will is also a certified reviewer on PullRequest where he’s caught hundreds of bugs, security issues, and other critical issues for over 40 teams.
I was 2 weeks into my job at Change.org the first time I took down production at that company. Yes, that’s right, the first time. I stood up in front of the whole engineering team of around 50 people at the end of my second week and gave a presentation about exactly what happened and how I caused a 30-ish minute site outage. I had just moved to the San Francisco Bay Area from the East Coast for this job the week before. My wife and I didn’t even have an apartment yet - we were staying in an Airbnb. She hadn’t landed a job in the area yet and so we were relying on my making my job work. In short…
Yes, I Was Terrified
This had to have been a land speed record for breaking stuff as a new hire at that company. I was sure that I was in big trouble. Especially when I was asked (very politely) to give a “festival of failure” presentation about what went wrong. So, I worked on my deck, tried to add in a little joke here and there and waited for the axe to fall. I stood up to give my presentation and surprisingly to me…
The Whole Team Laughed!
Everyone, from the most junior engineer all the way up to the VP of Engineering laughed hysterically. They weren’t mad. They thought my presentation was funny, and we had exposed a very easy way to break the site through a slow database migration. We decided that we would address the problem and I was given the task of fixing it along side my normal responsibilities. It took me a few weeks but I came up with a system for doing migrations in the background and it helped the organization avoid most similar issues going forward.
No, I Wasn’t Fired
Taking down production is part of life as a software developer. It happens from time to time and every Engineering organization worth its salt will take every opportunity to learn from a technical failure or honest mistake. Some folks may express frustration, especially if the outage occurs at an inconvenient time, but the frustration isn’t likely to be directed at you. Even if it is, it’s likely to pass quickly.
After a Mistake, You Have the Opportunity to Shine
Acknowledging and embracing the error, leaning into the learnings and doing your best to ensure the same error doesn’t happen again will make you a great contributor of the team even when the chips are down and you’re responsible. Taking appropriate responsibility will help you gain the respect of your peers. Everyone makes mistakes and being a stand-up colleague when you do will help you maintain trust with the rest of your team. Taking the initiative to fix the underlying issue, whether through a technical change or a process improvement, can definitely earn you points with your colleagues as well.
Remember, No Error Happens in Isolation
Unless you are the only developer on a project no error is entirely your fault. Someone built the deployment system that didn’t catch the problem, someone approved the code change, someone built the underlying system with the vulnerability. Responsibility is shared. Very few failures can truly be attributed to the actions of a single individual.
Even if you feel like it’s entirely your fault, do your best to remind yourself that it’s not. There are other people involved either in creating the error itself that just broke production or creating a production system that’s vulnerable to a specific type of mistake. Embrace this shared responsibility and let go of any guilt you might be feeling about making a mistake.
Conclusion: Going Offline Will Happen
Most software applications are not so mission critical that they can’t ever go offline unexpectedly. If you work on one that really can’t, you already know that and the way you do your work looks very different from the rest of the industry. For the rest of us, occasional outages are part of life. If the organization doesn’t ever have outages it’s a common sign that they’re moving too slowly. Normal development efforts balance risk with reward and occasionally break stuff as a side effect of trying to get good things done. Embrace those moments, learn from those failures and hug it out when things go wrong. Of course, don’t break stuff on purpose (that’s generally illegal), but allow yourself to be human. Making mistakes is part of the job.
Find this useful? Be sure to check out these additional articles by Will Barrett:
- How Giant Data Leaks Happen - Understanding Cryptographic Failures (OWASP Number 2 for 2021)
- Navigating Time Pressure in the Code Review Process
- What We Can Learn from the Ruby on Rails Project about Code Review
- 5 Ways I Provide Value as a PullRequest Reviewer When I Start Reviewing a New Project
- How to Catch the Top OWASP 2021 Security Category - Broken Access Control - in Code Review Part 1 and Part 2
- Tips for Migrating to a New Computer for Programmers
- The Top 5 Most Common Security Issues I Discover When Reviewing Code
- Reviewing One’s Own Code
- Handling Code Reviews with Empathy
- What We Can Learn About Code Review From the Apache Superset Project
- What We Can Learn About Code Review From the Apache Spark Project