Disclaimer: I felt like writing this post, cause constantly people ask me about “DevOps” and whether they need it. This is my opinionated view on it — the scope is my own.
DevOps has influenced way more parts of engineering than only the ones I mention and there is probably as many definitions of what DevOps is as there is people tired explaining DevOps. If you found your own definition on the most misunderstood term since “cloud”, maybe you want to scroll to the more practical examples at the bottom of the post.
Over the last couple of years I was lucky enough to work at different startups at different stages in their lifecycle. As you’d expect, the word DevOps as an empty buzzword filling term, over time was all over the place and everybody wanted their “DevOps Engineer” or “do DevOps”. With this post, I am trying to get to the bottom of what DevOps should mean for you and what might be of importance for you as a startup, that considers running production systems as well as doing software engineering. As one would assume, this has been a hot topic on a number of occasions, which I found myself in with coworkers and friends in IT operations as well as software engineering and I do feel quite strongly about the concepts behind “DevOps”, which in my opinion can give you a leap ahead of your competition, though only when applying “DevOps” correctly in your organisation.
One thing for sure, when somebody at a startups says “DevOps” they most definitely mean IT operations should not get in their own and their users way to scale and ultimately be successful. They might also mean, that they do not want an IT operations silo/department in their future company, because it could lead to just mentioned department being in their way to success. Capacity planning, servers etc. is just getting in their way and AWS teaches us, that in modern IT, we could have capacity on demand, experiment and scale any time we wanted to. The past shows a good number of examples that it’s not that easy, but this is how it’s being sold and some are able to crack and apply the puzzle of the cloud to their infrastructure.
Going to the bottom of it, this brings us to this one rockstar job title of a “DevOps Engineer”, which widely discussed is a fundamental flaw of the concept of DevOps as a culture of collaboration in between software engineering and IT operations as well as the practice of using software engineering concepts where applicable in IT operations. However, with how “DevOps” is perceived at most startups, mostly what they are looking for is a production/site reliability engineer, who is in charge of IT operations not getting in the way to scaling and ensuring reliability and speed of deployments and ultimately a successful product. However, the usage of the word “DevOps” as a job title is rather flawed, since what you actually want is your IT operations work to go hand in hand with your software engineering and even better your software engineering team thinking of the day to day operations questions. This is more of an organisational/cultural shift in your development teams or your operations teams mindset, rather than a position you can just hire. All of this can be achieved through close collaboration of IT operations with software engineers and in some cases those two hats are worn by the same people (e.g. Site Reliability/Production Engineer), which is actually totally on point with the widely accepted principles of DevOps at hand. In the end everyone deploying software in your organization can likely introduce failures into your system, which might cause it to be unavailable for your users and thus should have an interest in reliable and fast systems, hence a software engineer whose main focus is reliability might be a good add.
IT operations until at least 2009 and the “spring of DevOps” was most likely represented via a silo / functional team in your org-chart. So, given that and how long it takes for people to adapt their mindset to the new reality of software defined everything (network, storage, compute etc.) we see clearly, how easily even startups might fall back to the old pattern of Dev and Ops rather than both being mainly seen as software engineering jobs with different problem scopes.
“SRE is what happens when you ask a software engineer to design and run operations.” (Ben Treynor Sloss, VP 24x7, Google)
In a nutshell, where a software engineer might be more interested in delivering features, an SRE might be more interested in the stability of the delivery pipeline as a whole, although they are both approached with software engineering principles.
Since the whole field of software defined everything, cloud, whatever you call it, has not existed until 10 years ago, and really only picked up momentum the last 5 years, you might hire more easily an experienced IT operations engineer in the traditional sense rather than software engineers with a focus on reliability/scaling and let him maintain your production infrastructure and automation. Also you might find software engineers rather not wanting to think about IT operations too much, since it is considered not part of their job and as sometimes just as “getting in their way” to roll out features and ultimately getting to a successful product within their scope.
This thinking to me is fundamentally flawed though, since without resilience, a plan to scale of how your product works with your infra and good baseline performance of your product — if shit hits the fan you might find yourself quickly in a spiral of Ops vs. Dev teams playing a game of “blaming ping-pong” rather than working on the shared goal of fixing your product. Even worse, creating the Ops and Dev silos might lead to an unmanageable debt for software engineers, who might be interested working on your infrastructure in the future.
For instance, if you ask me, there is no need for a startup in 2016 anymore running servers without having them reproducible from a codebase. There is no way you are doing things correctly right now, if you can not reproduce your services with just a single click. Let it be immutable functions, containers, servers or a codebase. If you went into doing manual IT operations once, it’s hard to fight your way back out. Any part of your infrastructure that you created manually in the current age of cloud/software defined everything is in itself technical debt for a software engineer. It’s one you can’t reproduce from a codebase, one that is not refactorable for software engineers, it’s just something unmanaged and unmanageable and even worse, it’s likely to be a core asset of your production infrastructure and thus likely a core piece, that is making up for big parts of your uptime (e.g. creating a subnet in your SDN or setting a DNS record). If those parts of your infrastructure aren’t reproducible and services are not creatable via a one-click deployment, you basically have locked your software engineers out from changing your underlying infrastructure and e.g. creating automated tests for it to make sure the infrastructure still delivers on the promises it’s making to your users while being iterated and developed on, which should be the core part of your daily IT operations.
Really, if you went the wrong direction here, unfortunately it might soon be a good time to get an experienced IT operations engineer to maintain the “hot” systems in flight. Contrary to this, I do still think someone with good experience in IT operations should be part in building your product, they just don’t necessarily have to be an IT operations engineer doing manual action, but rather say, a platform engineer that is able to express his experience and best practices in code, setting you up for future scaling challenges.
Frankly, there is platform as a service providers that have already done the work for you and with a sensible evaluation of those you might find one fitting the way you want to develop and run your service, without the need for a dedicated operations engineer taking care of the underlying infrastructure for you. As with any solution in IT engineering and with all respect for those platforms and their hard working teams, they mostly only scale to a certain level and as with anything you get what you pay for. However, this save in manpower and time might give you a good jump ahead in your roadmap.
So, this might have been a bit of empty words to someone with experience in DevOps culture, and obviously there is a lot of theory about what to consider good “DevOps practices”. To finish this up, I want to provide you with the most important patterns I apply in my daily work, to make sure, that there is just enough software engineering practices in my infrastructure. A step by step guide to more resilient systems.
Generally a CI is considered a must have for software development projects these days. I don’t understand why this pattern does not apply to the infrastructure, which is what makes your software development project work. If you can make sure your systems bootstrap from your codebase every time you make changes to it, you can make sure it’s reproducible, giving you a head-start in disaster situations.
If you can test your infrastructure, you are at a good path making promises about the substantial parts of your system. This might come in important for creating documentation, since your tests can automatically lead to readable test output (see serverspec). This might be helping you significantly, when scaling out your engineering organization, since they can rely on these promises the documentation is making about the substantial parts of your system.
With tests at hand and maybe also documentation, you are on a real good track. If you are at the point, that your systems are fully bootstrapped from a codebase and even more tested every time you change the codebase, you will find yourself in the situation, in which it makes sense to package up all those tested instances or containers, that have already proven to deliver on their promises. Package them up after the testing is done into immutable images/containers and enjoy not having to think about a reliable provisioning process. The SLA of your provisioning process can never be higher than all the SLAs of your upstreams combined, which quickly adds up. Since you went through provisioning in testing already, package it up. http://chadfowler.com/2013/06/23/immutable-deployments.html
When you got here, you already have a number of feedback loops at hand. You are now making sure you can always recover your substantial systems from your codebase, continuously!
You got bits of disaster recovery right, now it’s about time not only getting feedback from your provisioning mechanics, but also from the mechanics keeping production online. Check what happens when that one machine in your cluster goes down (in staging first, in production afterwards), check what happens when you have to do a database failover. Automate it, execute it regularly, build feedback loops to make sure you can stay online, even when the world stands against you.
Even with all the practices described at hand, shit might hit the fan. Now you have to make sure to have the right medium at hand, in which you can talk about those failures in a productive way. It’s probably a systemic failure. Even if your colleague executed a drop table in production, it’s a problem of the underlying system, that they could do so in the first place. By focusing on your systems weaknesses and turning them into your systems strength, you will inevitably get to a system, which is getting more and more resilient. https://codeascraft.com/2012/05/22/blameless-postmortems/
With a focus of your remediation of the system on mean time to repair, since systems fail inevitably and less on mean time between failure, you will get to a more resilient system, that is more fluently adoptable to changing environments and requirements. https://www.safaribooksonline.com/library/view/designing-delivery/9781491903742/ch04.html
There is so many parts DevOps and agile have influenced, which affected my daily work environment for a better, that will never fit a post or book. This has frankly been a long post, one that leaves out a number of points and focussed on the systems resilience side of things.
In the end there is so much work already done on any of the topics one might consider “a DevOps thing to do”. Keep your eyes open, be humble, don’t get stuck in the past, don’t do repetitive tasks. Think about your product, infrastructure, organisation as a system — as the team of humans, machines, software. Things will go wrong, people will go wrong, parts will be missing, break or you will have just too many orders to process.
Making sure constantly you can always obtain a systemic view and trying to get the system as resilient as possible while increasing development velocity and innovation is the essence of DevOps to me.