Devops

Devops

Devops

Aug 16, 2020

DevOps for (Skeptical) Developers

DevOps for (Skeptical) Developers

DevOps for (Skeptical) Developers

In this post, I describe my personal journey as a developer skeptical

of the seemingly ever-growing, ever more complex, array of "ops"

tools. I move towards adopting some of these practices, ideas and

tools. I write about how this journey helps me to write software

better and understand discussions with the ops team at work.

Table of Contents On being skeptical The humble app Disk failures are not that common Auto-deployment is better than manual Backups become worth it Deployment staging Packaging with Docker is good Custodians multiple processes are useful Kubernetes provides exactly that Declarative is good, vendor lock-in is bad More advanced rollout Relationship between code and deployed state ArgoCD Infra-as-code Where the dev meets the ops What we do

On being skeptical

I would characterise my attitudes to adopting technology in two

stages:

  • Firstly, I am conservative and dismissive, in that I will usually

    disregard any popular new technology as a bandwagon or trend. I'm a

    slow adopter.

  • Secondly, when I actually encounter a situation where I've suffered,

    I'll then circle back to that technology and give it a try, and if I

    can really find the nugget of technical truth in there, then I'll

    adopt it.

Here are some things that I disregarded for a year or more before

trying: Emacs, Haskell, Git, Docker, Kubernetes, Kafka. The whole

NoSQL trend came, wrecked havoc, and went, while I had my back turned,

but I am considering using Redis for a cache at the moment.

The humble app

If you’re a developer like me, you’re probably used to writing your

software, spending most of your time developing, and then finally

deploying your software by simply creating a machine, either a

dedicated machine or a virtual machine, and then uploading a binary of

your software (or source code if it’s interpreted), and then running

it with the copy pasted config of systemd or simply running the

software inside GNU screen. It's a secret shame that I've done this,

but it's the reality.

You might use nginx to reverse-proxy to the service. Maybe you set up

a PostgreSQL database or MySQL database on that machine. And then you

walk away and test out the system, and later you realise you need some

slight changes to the system configuration. So you SSH into the system

and makes the small tweaks necessary, such as port settings, encoding

settings, or an additional package you forgot to add. Sound familiar?

But on the whole, your work here is done and for most services this is

pretty much fine. There are plenty of services running that you have

seen in the past 30 years that have been running like this.

Disk failures are not that common

Rhetoric about processes going down due to a hardware failure are

probably overblown. Hard drives don’t crash very often. They don’t

really wear out as quickly as they used to, and you can be running a

system for years before anything even remotely concerning happens.

Auto-deployment is better than manual

When you start to iterate a little bit quicker, you get bored of

manually building and copying and restarting the binary on the

system. This is especially noticeable if you forget the steps later

on.


If you’re a little bit more advanced you might have some special

scripts or post-merge git hooks, so that when you push to your repo it

would apply to the same machine and you have some associated token on

your CI machine that is capable of uploading a binary and running a

command like copy and restart (e.g. SSH key or API

key). Alternatively, you might implement a polling system on the

actual production system which will check if any updates have occurred

in get and if so pull down a new binary. This is how we were doing

things in e.g. 2013.

Backups become worth it

Eventually, if you're lucky, your service starts to become slightly

more important; maybe it’s used in business and people actually are

using it and storing valuable things in the database. You start to

think that back-ups are a good idea and worth the investment.


You probably also have a script to back up the database, or replicate

it on a separate machine, for redundancy.

Deployment staging

Eventually, you might have a staged deployment strategy. So you might

have a developer testing machine, you might have a QA machine, a

staging machine, and finally a production machine. All of these are

configured in pretty much the same way, but they are deployed at

different times and probably the system administrator is the only one

with access to deploy to production.


It’s clear by this point that I’m describing a continuum from "hobby

project" to "enterprise serious business synergy solutions".

Packaging with Docker is good

Docker effectively leads to collapsing all of your system dependencies

for your binary to run into one contained package. This is good,

because dependency management is hell. It's also highly wasteful,

because its level of granularity is very wide. But this is a trade-off

we accept for the benefits.

Custodians multiple processes are useful

Docker doesn’t have much to say about starting and restarting

services. I’ve explored using CoreOS with the hosting provider Digital

Ocean, and simply running a fresh virtual machine, with the given

Docker image.

However, you quickly run into the problem of starting up and tearing

down:

  • When you start the service, you need certain liveness checks

    and health checks, so if the service fails to start then you should

    not stop the existing service from running, for example. You should

    keep the existing ones running.

  • If the process fails at any time during running then you should also

    restart the process. I thought about this point a lot, and came to the

    conclusion that it’s better to have your process be restarted than to

    assume that the reason it failed was so dangerous that the process

    shouldn’t start again. Probably it’s more likely that there is an

    exception or memory issue that happened in a pathological case which

    you can investigate in your logging system. But it doesn’t mean that

    your users should suffer by having downtime.

  • The natural progression of this functionality is to support

    different rollout strategies. Do you want to switch everything to the

    new system in one go, do you want it to be deployed piece-by-piece?


It’s hard to fully appreciate the added value of ops systems like

Kubernetes, Istio/Linkerd, Argo CD, Prometheus, Terraform, etc. until

you decide to design a complete architecture yourself, from scratch,

the way you want it to work in the long term.

Kubernetes provides exactly that

What system happens to accept Docker images, provide custodianship,

roll out strategies, and trivial redeploy? Kubernetes.

It provides this classical monitoring and custodian responsibilities

that plenty of other systems have done in the past. However, unlike

simply running a process and testing if it’s fine and then turning off

another process, Kubernetes buys into Docker all the way. Processes

are isolated from each other, in both the network on the file

system. Therefore, you can very reliably start and stop the services

on the same machine. Nothing about a process's machine state is

persistent, therefore you are forced to design your programs in a way

that state is explicitly stored either ephemerally, or elsewhere.


In the past it might be a little bit scarier to have your database

running in such system, what if it automatically wipes out the

database process? With today’s cloud base deployments, it's more

common to use a managed database such as that provided by Amazon,

Digital Ocean, Google or Azure. The whole problem of updating and

backing up your database can pretty much be put to one

side. Therefore, you are free to mess with the configuration or

topology of your cluster as much as you like without affecting your

database.

Declarative is good, vendor lock-in is bad

A very appealing feature of a deployment system like Kubernetes is

that everything is automatic and declarative. You stick all of your

configuration in simple YAML files (which is also a curse because YAML

has its own warts and it's not common to find formal schemas for it).

This is also known as "infrastructure as code".

Ideally, you should have as much as possible about your infrastructure

in code checked in to a repo so that you can reproduce it and track

it.

There is also a much more straight-forward path to migrate from one

service provider to another service provider. Kubernetes is supported

on all the major service providers (Google, Amazon, Azure), therefore

you are less vulnerable to vendor lock-in. They also all provide

managed databases that are standard (PostgreSQL, for example) with

their normal wire protocols. If you were using the vendor-specific

APIs to achieve some of this, you'd be stuck on one vendor. I, for

example, am not sure whether to go with Amazon or Azure on a big

personal project right now. If I use Kubernetes, I am mitigating risk.

With something like Terraform you can go one step further, in which

you write code that can create your cluster completely from

scratch. This is also more vendor independent/mitigated.

More advanced rollout

Your load balancer and your DNS can also be in code. Typically a load

balancer that does the job is nginx. However, for more advanced

deployments such as A/B or green/blue deployments, you may need

something more advanced like Istio or Linkerd.

Do I really want to deploy a new feature to all of my users? Maybe,

that might be easier. Do I want to deploy a different way of marketing

my product on the website to all users at once? If I do that, then I

don’t exactly know how effective it is. So, I could perhaps do a

deployment in which half of my users see one page and half of the

users see another page. These kinds of deployments are

straight-forwardly achieved with Istio/Linkerd-type service meshes,

without having to change any code in your app.

Relationship between code and deployed state

Let's think further than this.

You've set up your cluster with your provider, or Terraform. You've

set up your Kubernetes deployments and services. You've set up your CI

to build your project, produce a Docker image, and upload the images

to your registry. So far so good.

Suddenly, you’re wondering, how do I actually deploy this? How do I

call Kubernetes, with the correct credentials, to apply this new

Doctor image to the appropriate deployment?

Actually, this is still an ongoing area of innovation. An obvious way

to do it is: you put some details on your CI system that has access to

run kubectl, then set the image with the image name and that will try

to do a deployment. Maybe the deployment fails, you can look at that

result in your CI dashboard.

However, the question comes up as what is currently actually deployed

on production? Do we really have infrastructure as code here?

It’s not that I edited the file and that update suddenly got

reflected. There’s no file anywhere in Git that contains what the

current image is. Head scratcher.

Ideally, you would have a repository somewhere which states exactly

which image should be deployed right now. And if you change it in a

commit, and then later revert that commit, you should expect the

production is also reverted to reflect the code, right?

ArgoCD

One system which attempts to address this is ArgoCD. They implement

what they call "GitOps". All state of the system is reflected in a Git

repo somewhere. In Argo CD, after your GitHub/Gitlab/Jenkins/Travis CI

system has pushed your Docker image to the Docker repository, it makes

a gRPC call to Argo, which becomes aware of the new image. As an

admin, you can now trivially look in the UI and click "Refresh" to

redeploy the new version.

Infra-as-code

The common running theme in all of this is

infrastructure-as-code. It’s immutability. It’s declarative. It’s

removing the number of steps that the human has to do or care

about. It’s about being able to rewind. It’s about redundancy. And

it’s about scaling easily.


When you really try to architect your own system, and your business

will lose money in the case of ops mistakes, then you start to think

that all of these advantages of infrastructure as code start looking

really attractive.

But before you really sit down and think about this stuff, however, it

is pretty hard to empathise or sympathise with the kind of concerns

that people using these systems have.


There are some downsides to these tools, as with any:

  • Docker is quite wasteful of time and space

  • Kubernetes is undoubtedly complex, and leans heavily on YAML

  • All abstractions are leaky, therefore tools like this all leak

Where the dev meets the ops

Now that I’ve started looking into these things and appreciating their

use, I interact a lot more with the ops side of our DevOps team at work,

and I can also be way more helpful in assisting them with the

information that they need, and also writing apps which anticipate the

kind of deployment that is going to happen. The most difficult

challenge typically is metrics and logging, for run-of-the-mill apps,

I’m not talking about high-performance apps.


One way way to bridge the gap between your ops team and dev team,

therefore, might be an exercise meeting in which you do have a dev

person literally sit down and design an app architecture and

infrastructure, from the ground up using the existing tools that we

have that they are aware of and then your ops team can point out the

advantages and disadvantages of their proposed solution. Certainly,

I think I would have benefited from such a mentorship, even for an

hour or two.


It may be that your dev team and your ops team are completely separate

and everybody’s happy. The devs write code, they push it, and then it

magically works in production and nobody has any issues. That’s

completely fine. If anything it would show that you have a very good

process. In fact, that’s pretty much how I’ve worked for the past

eight years at this company.

However, you could derive some benefit if your teams are having

difficulty communicating.

Finally, the tools in the ops world aren't perfect, and they're made

by us devs. If you have a hunch that you can do better than these

tools, you should learn more about them, and you might be right.

What we do

FP Complete are using a great number of these tools, and we're writing

our own, too. If you'd like to know more, email use at

sales@fpcomplete.com.