Containerizing a legacy application: an overview

Devops

Jan 12, 2017

Containerizing a legacy application: an overview

An overview of what containerization is, the reasons to consider

running a legacy application in Docker containers, the process to

get it there, the issues you may run into, and next steps once you

are deploying with containers. You'll reduce the stress of

deployments, and take your first steps on the path toward no

downtime and horizontal scaling.

Note: This post focuses on simplifying deployment of the

application. It does not cover topics that may require

re-architecting parts of the application, such as high-availability

and horizontal scaling.

Concepts

What is a "Legacy" App?

There's no one set of attributes that typifies all legacy apps, but common attributes include:

Using the local filesystem for persistent storage, with data files intermingled with application files.
Running many services on one server, such as a MySQL database,
Redis server, Nginx web server, a Ruby on Rails application, and a
bunch of cron jobs.
Installation and upgrades use a hodgepodge of scripts and manual processes (often poorly documented).
Configuration is stored in files, often in multiple places and intermingled with application files.
Inter-process communication uses the local filesystem (e.g.
dropping files in one place for another process to pick up) rather
than TCP/IP.
Designed assuming one instance on the application would run on a single server.

Disadvantages of the legacy approach

Automating deployments is difficult
If you need multiple customized instances of the application,
it's hard to "share" a single server between multiple
instances.
If the server goes down, can take a while to replace due to manual processes.
Deploying new versions is a fraught manual or semi-manual process which is hard to roll back.
It's possible for test and production environments to drift
apart, which leads to problems in production that were not detected
during testing.
You cannot easily scale horizontally by adding more instances of the application.

What is "Containerization"?

"Containerizing" an application is the process of making it able

to run and deploy under Docker containers and similar technologies

that encapsulate an application with its operating system

environment (a full system image). Since containers provide the

application with an environment very similar to having full control

of a system, this is a way to begin modernizing the deployment of

the application while making minimal or no changes to the

application itself. This provides a basis for incrementally making

the application's architecture more "cloud-friendly."

Benefits of Containerization

Deployment becomes much easier: replacing the whole container image with a new one.
It's relatively easy to automate deployments, even having them driven completely from a CI (continuous integration) system.
Rolling back a bad deployment is just a matter of switching back to the previous image.
It's very easy to automate application updates since there are
no "intermediate state" steps that can fail (either the whole
deployment succeeds, or it all fails).
The same container image can be tested in a separate test
environment, and then deployed to the production environment. You
can be sure that what you tested is exactly the same as what is
running in production.
Recovering a failed system is much easier, since a new
container with exactly the same application can be automatically
spun up on new hardware and attached to the same data stores.
Developers can also run containers locally to test their work in progress in a realistic environment.
Hardware can be used more efficiently, by running multiple
containerized applications on a single host that ordinarily could
not easily share a single system.
Containerizing is a good first step toward supporting
no-downtime upgrades, canary deployments, high availability, and
horizontal scaling.

Alternatives to containerization

Configuration management tools like Puppet and Chef help with
some of the "legacy" issues such as keeping environments
consistent, but they do not support the "atomic" deployment or
rollback of the entire environment and application at once. This
can still go wrong partway through a deployment with no easy way to
roll everything back.
Virtual machine images are another way to achieve many of the
same goals, and there are cases where it makes more sense to do the
"atomic" deployment operations using entire VMs rather than
containers running on a host. The main disadvantage is that
hardware utilization may be less efficient, since VMs need
dedicated resources (CPU, RAM, disk), whereas containers can share
a single host's resources between them.

How to containerize

Preparation

Identify filesystem locations where persistent data is written

Since deploying a new version of the application is performed by

replacing the Docker image, any persistent data must be stored

outside of the container. If you're lucky, the application

already writes all its data to a specific path, but many legacy

applications spread their data all over the filesystem and

intermingle it with the application itself. Either way, Docker's

volume mounts let us expose the host's filesystem to specific

locations in the container filesystem so that data survives between

containers, so we must identify the locations to persist.

You may at this stage consider modifying the application to

support writing all data within a single tree in the filesystem, as

that will simplify deployment of the containerized version.

However, this is not necessary if modifying the application is

impractical.

Identify configuration files and values that will vary by environment

Since a single image should be usable in multiple environments

(e.g. test and production) to ensure consistency, any configuration

values that will vary by environment must be identified so that the

container can be configured at startup time. These could take the

form of environment variables, or of values within one or more

configuration files.

You may at this stage want to consider modifying the application

to support reading all configuration from environment variables, as

that that will simplify containerizing it. However, this is not

necessary if modifying the application is impractical.

Identify services that can be easily externalized

The application may use some services running on the local

machine that are easy to externalize due to being highly

independent and supporting communication by TCP/IP. For example, if

you run a database such as MySQL or PostgreSQL or a cache such as

Redis on the local system, that should be easy to run externally.

You may need to adjust configuration to support specifying a

hostname and port rather than assuming the service can be reached

on localhost.

Creating the image

Create a Dockerfile that installs the application

If you already have the installation process automated via

scripts or using a configuration management tool such as Chef or

Puppet, this should be relatively easy. Start with an image of your

preferred operating system, install any prerequisites, and then run

the scripts.

If the current setup process is more manual, this will involve

some new scripting. But since the exact state of the image is

known, it's easier to script the process than it would be when you

have to deal with the potentially inconsistent state of a raw

system.

If you identified externalizable services earlier, you should modify the scripts to not install them.

A simple example Dockerfile:

# Start with an official Ubuntu 16.04 Docker image
FROM ubuntu:16.04

# Install prerequisite Ubuntu packages
RUN apt-get install -y <REQUIRED UBUNTU PACKAGES> 
 && apt-get clean 
 && rm -rf /var/lib/apt/lists/*

# Copy the application into the image
ADD . /app

# Run the app setup script
RUN /app/setup.sh

# Switch to the application directory
WORKDIR /app

# Specify the application startup script
COMMAND /app/start.sh

Startup script for configuration

If the application takes all its configuration as environment

variables already, then you don't need to do anything. However, if

you have environment-dependent configuration values in

configuration files, you will need to create an application startup

script that reads these values from environment variables and then

updates the configuration files.

An simple example startup script:

#!/usr/bin/env bash
set -e

# Append to the config file using $MYAPPCONFIG environment variable.
cat >>/app/config.txt <<END
my_app_config = "${MYAPPCONFIG}"
END

# Run the application using $MYAPPARG environment variable for an argument.
/app/bin/my-app --my-arg="${MYAPPARG}"

Push the image

After building the image (using docker build), it

must be pushed to a Docker Registry so that it can be pulled on the

machine where it will deployed (if you are running on the same

machine as the image was built on, then this is not necessary).

You can use Docker Hub for

images (a paid account lets you create private image repositories),

or most cloud providers also provide their own container registries

(e.g. Amazon ECR).

Give the image a tag (e.g. docker tag myimage mycompany/myimage:mytag) and then push it (e.g. docker push mycompany/myimage:mytag). Each image for a version of

the application should have a unique tag, so that you always know

which version you're using and so that images for older versions

are available to roll back to.

How to deploy

Deploying containers is a big topic, and this section just focuses on directly running containers using docker commands. Tools like docker-compose (for simple cases where all containers run on a single server) and Kubernetes (for container orchestration across a cluster) should be considered in real-world usage.

Externalized services

Services you identified for externalization earlier can be run

in separate Docker containers that will be linked to the main

application. Alternatively, it is often easiest to outsource to

managed services. For example, if you are using AWS, using RDS for

a database or Elasticache for a cache significantly simplifies your

life since they take care of maintenance, high availability, and

backups for you.

An example of running a Postgres database container:

docker run 
    -d 
    --name db 
    -v /usr/local/var/docker/volumes/postgresql/data:/var/lib/postgresql/data 
    postgres

The application

To run the application in a Docker container, you use a command-line such as this:

docker run 
    -d 
    -p 8080:80 
    --name myapp 
    -v /usr/local/var/docker/volumes/myappdata:/var/lib/myappdata 
    -e MYAPPCONFIG=myvalue 
    -e MYAPPARG=myarg 
    --link db:db 
    myappimage:mytag

The -p argument exposes the container's port 80 on the host's port 8080, -v argument sets up the volume

mount for persistent data (in the

hostpath:containerpath format), the -e

argument sets a configuration environment variable (these may both

be repeated for additional volumes and variables), and the

--link argument links the database container so the

application can communicate with it. The container will be started

with the startup script you specified in the Dockerfile's

COMMAND.

Upgrades

To upgrade to a new version of the application, stop the old container (e.g., docker rm -f myapp) and start a new

one with the new image tag (this will require a brief down time).

Rolling back is the similar, except that you use the old image

tag.

Additional considerations

"init" process (PID 1)

Legacy applications often run multiple processes, and it's not

uncommon for orphan processes to accumulate if there is no "init"

(PID 1) daemon to clean them up. Docker does not, by default,

provide such a daemon, so it's recommended to add one as the

ENTRYPOINT in your Dockerfile. dumb-init is an example lightweight init daemon, among others. phusion/baseimage

is a fully-featured base image that includes an init daemon in

addition to other services.

See our blog post dedicated to this topic: Docker demons: PID-1, orphans, zombies, and signals.

Daemons and cron jobs

The usual way to use Docker containers is to have a single

process per container. Ideally, any cron jobs and daemons can be

externalized into separate containers, but this is not always

possible in legacy applications without re-architecting them. There

is no intrinsic reason why containers cannot run many processes,

but it does require some extra setup since standard base images do

not include process managers and schedulers. Minimal process

supervisors, such as runit,

are more appropriate to use in containers than full-fledged systems

like systemd. phusion/baseimage

is a fully-featured base image that includes runit and cron, in

addition to other services.

Volume-mount permissions

It's common (though not necessarily recommended) to run all processes in containers as the root user. Legacy

applications often have more complex user requirements, and may

need to run as a different user (or multiple processes as multiple

users). This can present a challenge when using volume mounts,

because Docker makes the mount points owned by root by default, which means non-root processes will not be able to write to them. There are two ways to deal with this.

The first approach is to create the directories on the host

first, owned by the correct UID/GID, before starting the container.

Note that since the container and host's users don't match up, you

have to be careful to use the same UID/GID as the container, and

not merely the same usernames.

The other approach is for the container itself to adjust the

ownership of the mount points during its startup. This has to

happen while running as root, before switching to a non-root user to start the application.

Database migrations

Database schema migrations always present a challenge for

deployments, because the database schema can be very tightly

coupled with the application, and that makes controlling the timing

of the migration important, as well as making rolling back to an

older version of the application more difficult since database

migrations can't always be rolled back easily.

A way to mitigate this easily is to have a staged approach to

migrations. You need to make an incompatible schema change, you

split that change over two application deployments. For example, if

you want to move a piece of data from one location to another,

these would be the phases:

Write the data to both the old and new locations, and read it
from the new location. This means that if you roll the application
back to the previous version, any the new data is still where it
expects to find it.
Stop writing it to the old location.

Note that if you want to have deployments with no downtime, that

means running multiple versions of the application at the same

time, which makes this even more of a challenge.

Backing up data

Backing up from a containerized application is usually easier

than the non-containerized deployment. Data files can be backed up

from the host and you don't risk any intermingling of data files

with application files because they are strictly separated. If

you've moved databases to managed services such as RDS, those can

take care of backups for you (at least if your needs are relatively

simple).

Migrating existing data

To transition the production application to the new

containerized version, you will need to migrate the old

deployment's data. How to do this will vary, but usually the

simplest is to stop the old deployment, back up all the data, and

restore it to the new deployment. This should be practiced in

advance, and will necessitate some down time.

Conclusion

While it requires some up-front work, containerizing a legacy

application will help you get control of, automate, and minimize

the stress of deploying it. It sets you on a path toward

modernizing your application and supporting no-downtime

deployments, high availability, and horizontal scaling.

FP Complete has undertaken this process many times in addition

to building containerized applications from the ground up. If you'd

like to get on the path to modern and stress-free deployment of

your applications, you can learn more about our Devops and Consulting

services, or contact us straight away!