Shellshock and automating infrastructure complexity

September 30, 2014 - Mike Place

Shellshock and automating infrastructure complexity

From the look of things, Shellshock beats Heartbleed from both a severity perspective and from a system administrator’s workload perspective. It has been good to see SaltStack play a role in alleviating much of the work to diagnose and remediate vulnerable systems. If nothing else, the past few days have been a stark reminder that this stuff is hard work, automating infrastructure complexity data center things you first have to understand those things, and that tools may hide complexity but the automating infrastructure complexity still exists.

An article by Mike Loukides titled, “Beyond the Stack,” sums up the challenge of infrastructure complexity quite nicely, “I’m not arguing that the sky is falling because … cloud. But it is critically important to understand what the cloud means for the systems we deploy and operate. As the number of systems involved in an application grows, the number of failure modes grows combinatorially. An application running over 10 servers isn’t 10 times as complex as an application running on a single server; it’s thousands of times more complex.”

Enter Shellshock, Heartbleed, etc.
Shellshock was discovered last week and affects untold numbers of servers worldwide. This exploit is considered catastrophic by many in the security community due to its potential widespread impact on everything from shell servers to web servers to nearly everything in between. As long as a Bash shell is available on a system, there is cause for concern.

Naturally, systems engineers everywhere were quick to respond and SaltStack users were able to diagnose systems exposure and remediate the flaw in seconds by simply targeting all of their systems with a command to upgrade to the affected package. For most, it was as simple as:

“salt ‘*’ pkg.install refresh=True bash”

And a few of the results made possible with SaltStack automation:

Those unfamiliar with (or unimpressed by) configuration management and server orchestration software might point out that this isn’t a particularly neat trick.

After all, there are almost innumerable ways to accomplish this goal. Whether it’s simply an SSH loop, an automated clustered SSH tool like cssh or the use of tools like Fabric or Capistrano, automating (infrastructure complexity) simple tasks like repeating a single command across a few dozen machines is…simple. For such a trivial need, we’ll be the first to admit that configuration management software of any flavor is simply overkill.

However, let’s examine a few cases when a powerful management system might save time over more simplistic techniques:

Disparate infrastructure
Wouldn’t it be nice if every single deployment used only a single flavor of your favorite operating system running a single application on standardized architecture?

Of course, this is usually never the case. Your boss decided you needed four Windows Servers to run his mission-critical unicorn generator Java app…and half a dozen AIX systems to run the billing system his cousin’s company sold him in 1994…and you’ve somehow managed to inherit the Linux infrastructure from an old team who couldn’t decide which Linux flavour to run so they went with…all of them.

So, even if you could construct an SSH loop to connect to all of these machines, you’d end up running very different commands on each type to perform the upgrade. Perhaps you could modify your script to do this and hope for the best, but it’s certainly time-consuming and potentially error-prone. Oh, and let’s remember for this to even work that you first need SSH access to all of those servers. Not guaranteed.

Moving Targets
Systems rarely sit still. They come up, they go down. Networks lag and sometimes fail. Sometimes they’re responsive, other times not as much. A scripted approach to a rolling update to something like Shellshock becomes more difficult as the number of systems increases because the need to guarantee the update was applied also increases. To have a high degree of confidence, results need to be centrally collected and easy to audit. While it might be tempting to simply patch systems and then run a second script to check the results, auditing offline machines is a critical concern.

Coordinated updates
The easiest fixes are those that have little to no chance of affecting production services. Certainly, we know this can’t always be possible.

Case-in-point, the Shellshock bug was believed to be externally exploitable through a number of means including Web servers for certain types of applications. This is a fairly common scenario, since updating system software, naturally, implies managing the application that runs on the system.

Orchestrate and automate the complexity
These concerns created the need for orchestration software like SaltStack. In the case of a bug like Shellshock, the need might be more complex than simply, “run a package upgrade on all my machines”. It could be, “Run a package upgrade on two load balancers at a time, but before you start it, take the web tier behind those load-balancers offline. Oh, and if the web servers don’t restart, don’t bring them back into the load-balancer rotation. What’s more, when you’ve verified that that all the web servers are back online, do the same thing but apply it to the database layer. And if the whole thing fails, make sure you can roll it back.”

Most often, it’s not the single package upgrade itself that causes difficulty. It’s the work of having to stage an upgrade across a complex, interdependent infrastructure, then wait for each step to finish before verifying success or failure before moving on the next. Any systems engineer can share stories about the upgrade that took all day (or all week), and this is most frequently the reason why.

What’s more, with some bugs like Shellshock the “fix” isn’t a fix the first time. So, after spending a whole day doing a rolling an upgrade by hand, even with the help of basic scripting tools, some unlucky engineer only comes in the next morning to discover they need to do it all again. An orchestration solution built for scale quickly becomes appealing.

Even when an upgrade seems simple, be sure to consider all the angles. The right tools will make all the difference in getting the job done quickly and effectively.