How Fast can you Rebuild your Most Important Server?

We’ve all been there.  One of your critical servers has a serious problem and you are trying to get it running again.  There could be a hardware problem, a problem with the operating system, an application deployment problem, or some type of malware incident.  All you know is that your server is down, a critical application is not working, and you need to get it running again.  How long will you work on it to get it running again?  As long as it takes, right?  This is your baby.  You built it by hand originally and updated it with care ever since.  So naturally you’ll do everything necessary to bring it back up.  Will you spend hours or days to fix the old server, or will you give a short effort to fix the old server and then kill it off and replace it with an identical server using an automated process?  The answer depends on whether you are treating your servers as pets or as cattle.  Pets are cared for and loved.  Cattle are part of a herd and can be easily replaced.

Pets:

How much of the process to build that server is documented?  Most likely you used a good process to build the original server, perhaps based off an easily replicable server build image.  That’s a good start.  How many manual changes have been made to your server since it was built several years ago?  How many monthly patches?  How many manual tweaks have you done to make the server security compliant and to get the application running correctly?  Is every one of those changes documented also?  Even if they are documented, how long will it take you to replicate by hand every change you’ve made since the server was originally built?  If you do not know every precise change that has ever been made to your server, and you really need to fix this server, then you are treating your server as a pet.  You will do everything you can to revive it.  You will spend as much time and money as it takes to make it healthy again.

 

Cattle:

Imagine though if you had built and managed your server using “infrastructure as code” techniques.  With this process, everything it takes to build the server and every change that was ever applied to the server would be defined in a set of code.  That code would be checked into a code repository such as Git, and there would be a record of every change.  With this server management process, you would try fix your dead server for a short period of time and if you were unable to recover it quickly, you would simply kill it off and replace it with an identical server in a matter of minutes.  Now you are managing your servers more like cattle, where you know if one is sick it can be easily replaced by another one. 

 

The same concept can be used to add capacity by building additional identical servers that be load balanced when you need to handle additional load, and then the extra servers can be eliminated to save money when there is no need for them.  This idea can also be expanded to include not only your servers but also your critical network infrastructure such as switches and firewalls.  Your entire network can be defined and built as code, and can be taken down and replaced at will with an identical one.  Think of the repeatability that this kind of infrastructure can provide.  You will know with 100% certainty that the next server you build will be identical to the current server.  It will come back up without having to spend a few days remembering and applying all the custom hand tweaks that your “pet” server has.  It will automatically pass all the security compliance checks, because those were also written into the code that built the infrastructure.  And one more benefit:  with all this automation and repeatability comes the ability to replicate this exact environment in another location quickly and easily, making your disaster recovery plan nice and easy.

 

What does it take to build an infrastructure as code environment?  This work is usually done in a virtual environment such as in Amazon or Azure, but it can also be done in a local VMWare farm.  You’ll need to have a team of infrastructure engineers who understands the system and networking processes to build servers and networks, but who are also conversant with coding techniques and writing applications.   Except in this case the “application” is your network and servers.  Building the infrastructure as code environment is typically the role of the DevSecOps group.  As the name implies, this group combines the roles of Development, Security, and Operations.  We must change our mindset so that we drop the idea of ever making manual changes to the system, because any manual changes will not be captured and replicated if a server is rebuilt.  Instead, all changes need to go back to the DevSecOps team to get implemented in code.  The application developers need to integrate with the DevSecOps team to make sure the application is written and deployed in a way that does not need manual configuration to implement.  The Security team needs to coordinate with the DevSecOps team to ensure that all compliance requirements are integrated into the system-build process and are fully automated. 

 

It takes time and commitment to make the infrastructure as code process work correctly.  After all, it is easier to make a quick manual change on a server than it is to identify the change, figure out where in the build process to make the change, change the code, check the code into your source repository, test that your change works in QA, and then finally deploy the change.  The payoff is in the repeatability of your network and servers.  You will know exactly how your server is built, and you will have a detailed record of every change as part of a well-defined set of code changes in a source control repository.  Your change control process will be exceptionally well documented.  Your security team will love the fact that every server has all the security settings baked in as soon as it is built, and you don’t have to spend days manually duplicating a set of compliance requirements.

 

The concept of managing servers as part of a herd is not new.  Amazon and others have been advocating this process for years, and it has worked well in commercial applications.  However, it has been slow to get integrated into government infrastructure because most government servers have traditionally been hand built inside dedicated data centers, hand cared for, and treated very much like pets by dedicated systems administrators.  As the infrastructure as code process has matured, it has gained acceptance in government circles in large part because of the extremely robust change control and repeatability that it delivers. 

 

ThreeWire Systems has helped build and manage DoD systems that make use of infrastructure as code techniques.  We have helped guide the infrastructure though an accreditation process to attain an authority to operate (ATO).  We have experienced problems with these servers, and after a brief troubleshooting period simply “killed” a problem server and let another server from the “herd” take over and continue running with no interruption.  When the environment is set up correctly, each component is completely and rapidly replaceable, and you will no longer have to treat your servers as pets.  Let ThreeWire Systems help convert or build your application using infrastructure as code techniques.

 

References:

http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/

http://www.informit.com/articles/article.aspx?p=2743758