Avoid frostbite: Stop doing code freezes
A code freeze is intentionally halting changes to your codebase and environments in an effort to reduce the risk of an outage.On the surface, pausing on deployments feels like a logical solution to preventing incidents. Unfortunately, this isn't the case.
By Robert Ross on 11/11/2021
As the holiday season aggressively approaches I want to perform a public service announcement for everyone toying with the idea of a code freeze for the holidays: please don't. It’s getting cold outside and the season of peppermint mochas is upon us, which might get you thinking about putting a code freeze in place for the holidays. A Word of warning: instituting a code freeze may have unintended consequences.
A code freeze is intentionally halting changes to your codebase and environments in an effort to reduce the risk of an outage. On the surface, pausing on deployments feels like a logical solution to preventing incidents. Because most incidents are caused by some change, such as a code deployment, config change, etc., it stands to reason that a code freeze reduces the chances of an outage.
The truth is, incidents are inevitable. Code freezes ultimately do not lead to a decrease in incidents but will shift the incidents that you do have to unexpected places. Sometimes these incidents are so hard to diagnose that it will take longer to resolve them, and no one likes interruptions during a delicious turkey dinner.
Negative space is harder to comprehend
In the absence of regular deploys, an incident can and will occur. It might be something straightforward like a memory leak that were obscured by the frequency of your deploys. But there are ones where you have no idea why the application with no recent changes has suddenly started to misbehave.
Let’s take a trip with the Ghost of Christmas Past. I was responsible for a Rails application that had stalled and all requests were timing out during the holiday season. The effects were felt by everyone in the whole company and a SEV1 incident was immediately declared. We were in a code freeze, so what on earth happened?
It became apparent that our application was running Postgres queries where their runtime suddenly ballooned, even for simple
SELECT statements. It took hours for us to unravel what was really happening. Was an index missing? Are there connectivity problems to the database? Do we have a noisy neighbor? Based on our experience (we’d had those incidents in the past, who hasn't?), we couldn't quickly comprehend that not deploying the application was the final Jenga piece in this complex incident.
I'll make the story short. Our Rails application used a version of ActiveRecord that did not create prepared statements correctly for queries that involved date ranges (
WHERE created_at IS BETWEEN ? AND ?). The way that Postgres works, our application was suddenly creating thousands and thousands of prepared statements for hundreds of database connections. Postgres stores prepared statements per connection in memory, and eventually... the database started swapping because it ran out of memory.
We never encountered this memory leak because we always released the resources by, you guessed it, deploying.
The January deploy frenzy
Just because you’ve implemented a code freeze doesn't mean your team has stopped building and working on projects. Plenty of work is happening and being held on staging or in unmerged pull requests. When the first working day of January rolls around, absolute chaos can ensue as everyone starts the merge fire sale.
The log-jammed deploys going out in rapid succession are even more likely to cause an incident because now you have a substantial amount of new code that hasn't been in a production environment. Shipping smaller changes on a regular basis yields more stability than massive changesets. Rapidly changing your deployment cadence twice brings your system into a state that your system has never operated within, will your CI/CD pipeline hold up? What if you need to rollback a change from three merges ago? The sudden onslaught of production changes are far more likely to cause damage than staying the course and never stopping deploys at all.
Your current reliability is based on your current process
Everyone is always striving for a more reliable system. I'm not convinced that we're so dissatisfied with our current reliability that stopping deployments will solve our problems during our busiest seasons. This holiday season why not enjoy new deployments and that tasty peppermint mocha.
"Speed has never killed anyone. Suddenly becoming stationary, that's what gets you."
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo