Cracking the Container Code: Navigating Kubernetes, Docker, Virtual Machines, Helm, and Secure Orchestration with Code Scanning

Written by

Albert Heinle

Introduction

There is a large misconception and buzzword bingo playing with terms all surrounding one concept: Linux containers and their orchestration.

At times, it is important to follow the Thai-Chi principle of “standing”: Going back to first principles and revisiting where it comes from.

For that, let us describe all the common buzzwords from the start. We will get into more details but this is the TL;DR in table form:

Virtualization Technologies	Unique Properties
Hypervisor Virtualization (VMWare, Xen, Hyper-V, etc.)	Emulation of other operating systems and hardware on a running host machine. Use cases often mutable and not in code. Performance loss due to hypervisor user. Stronger isolation between VMs.
OS Virtualization (aka containers, e.g., Docker, LXC, etc.)	Shared (OS) Linux Kernel with the host, abstraction of any other operating system resource. Definition as code (e.g., Dockerfile) Close to bare-metal performance. Quick deployment, highly portable across different platforms.

Orchestration system	Key Features
docker-compose multiple containers, single host	Ideal to simulate a network of containers on one host machine. Simple YAML definition with networking, volume mounts, port forwarding, etc.
docker swarm multiple containers, multiple hosts	All of Docker-compose, Plus the ability to spread containers across different hosts to create redundancy.
Kubernetes multiple containers, multiples hosts	All of Docker Swarm, Plus additional capabilities with respect to access control, network routing, up- and down-sizing, automated deployment logic, and controlled rollout.
Helm multiple containers, multiples hosts	All of Kubernetes Plus allowing for templating. This is useful to create exact replicas of the same environment across different host clusters, and to use and quickly adapt existing setups of complicated technologies (e.g. Apache Spark).

In the beginning, there were … virtual machines!

The concept of virtual machines has been very simple: Run different operating systems on a single machine to satisfy certain requirements for an organization’s software, while not conflicting with individual version requirements. While amazing, there was a so-called hypervisor involved that came with a very steep performance cost. Additionally, the setup of those virtual machines was done manually a lot, and re-creating an existing, working image has been challenging.

Containers - A New Hope

While at first only enabled on Linux machines, there was a new feature in the Kernel that would enable Docker in the future: CGroups. In a nutshell: Scrape the hypervisor, have different “containers” have access to the Kernel functionality and control their access to resources otherwise. No more performance loss due to hypervisors, and bare-metal performance on processes. Frameworks like LXC and Docker were born, and both made the user define the environment using code… I.e. they became reproducible, and the instructions were possible to be stored in code repositories. While mutable in theory, the design patterns around containers went towards immutability, and data was stored through mounts.

One of the most popular frameworks to make use of containers on your system is Docker. The installation is very simple, and by the end, you have a Docker service running on your computer that you can interact with.

Docker allows you to, using a single file (Dockerfile), define a completely new environment, install the software dependencies that you need for a specific piece of software. This environment is stored as a so-called “image” (docker build -t foo ….). To run it, you create a container from this image (docker run foo …). Different containers also generally require far less space than virtual machines (if done right).

For example, see how the OpenTelemetry team extended the NGINX base image to use OpenTelemetry.

Comment on Windows: But what about Windows? You saw it correctly, containers are a Linux feature…Docker can be installed on Windows, but we have seen many users report performance slowdowns similar to running multiple hypervisor. With the introduction of the Windows subsystem for Linux (WSL2), there is a significantly improved performance and this is not a problem any more.

Multiple Containers, Single Host

It is best practice to have one process per container. And to get the most out of containerization, “you want to avoid one container being responsible for multiple aspects of your overall application”. Let’s look at containerizing Wordpress as an example.

Wordpress requires a database connection. Following the Unix philosophy of “make each program do one thing well”, we’ll want to separate containers for the database and application server. So now you will need two containers. One for MySQL, one for Wordpress (maybe one for nginx and another for php-fpm – for simplicity we’ll use 2 containers). And both need to connect to each other, and ideally the MySQL database does not accept connections from anywhere else. This is the beginning of the “orchestration” of your containers.

Multiple containers running on a single machine. This is where we start with docker-compose. On a single machine, you can define a docker-compose.yml file and set the parameters for the interactions between different containers. In newer versions of Docker, compose is shipped with Docker, whereas sometimes you need to install it separately.

Sample docker-compose file for a simple Wordpress environment (Source):


version: '3.1'

services:

  wordpress:
    image: wordpress
    restart: always
    ports:
      - 8080:80
    environment:
      WORDPRESS_DB_HOST: db
      WORDPRESS_DB_USER: exampleuser
      WORDPRESS_DB_PASSWORD: examplepass
      WORDPRESS_DB_NAME: exampledb
    volumes:
      - wordpress:/var/www/html

  db:
    image: mysql:8.0
    restart: always
    environment:
      MYSQL_DATABASE: exampledb
      MYSQL_USER: exampleuser
      MYSQL_PASSWORD: examplepass
      MYSQL_RANDOM_ROOT_PASSWORD: '1'
    volumes:
      - db:/var/lib/mysql

volumes:
  wordpress:
  db:

There is a lot going on in this example. First, you have a section called services. This basically defines the containers which you want to group together into one application. Every service in services has an image, which defines the base-docker-image the container is created from. Additional elements like environment variables can be set.

In order to not lose the data stored by the containers in case they crash, volumes are defined and mounted into the container. Finally, the different containers can find each other through a Docker bridge network, which also has a DNS system (notice how WORDPRESS_DB_HOST is set to db).

Multiple Containers, Multiple Hosts

‍

You can see that docker-compose has a lot of built-in goodies: DNS, networking, volumes.

It has some advantages particularly for those that are new to containerization. It’s limitations become clear when applications need to be highly available and/or scale horizontally across multiple machines. This is where Kubernetes, Docker Swarm or other orchestration tools are preferred.

In dealing with multiple hosts, communication between machines becomes part of the consideration. In principle, it is always the same: There are “manager” nodes and there are regular nodes.

Docker Swarm

For Docker Swarm, the setup of managers and nodes is fairly simple. We can migrate the docker-compose.yaml to Docker Swarm mode. The file structure is similar, but each service will have an additional section called deploy. In the deploy section, you can define how many replicas the specific container should have, or even rollback configurations. It is already pretty complete for simple applications.

Kubernetes

Kubernetes is designed to automate the deployment, scaling, and management of containerized applications within its cluster of hosts. It groups containers that make up an application into logical units (clusters) for easy management and discovery. In Kubernetes, you have a set of so-called manifests, and they define your cluster. You can define everything down to the last detail: If a pod comes up, what are the resource requirements for the host that it is deployed on? How many containers can access a storage resource? You can even abstract external resources and make access points on the internal network.

Kubernetes provides incredible freedom, but with great freedom comes great complexity – if not carefully architected.

AWS has realized the challenges of building a team to delivery a highly available, up-to-date and secure container platform is a barrier to adoption and provides ECS, which is simplified to serve most containerized applications out there. ECS provides simplicity where AWS handles operations, patching, scalability and security of the underlying infrastructure. This reduces the number of decisions and the time to build, deploy or migrate to Kubernetes at scale.

The setup for Kubernetes is more complicated (there are multiple components, such as schedulers, api, loggers, etc.), but there are frameworks like MicroK8s by Canonical that can help reduce the complexity in setup. You can follow the installation process on multiple hosts and have a functional Kubernetes system running fairly quickly. Beyond frameworks like MicoK8s, there are pre-set up Kubernetes environments such as AWS EKS, Azure’s AKS or Google’s GKE. For the purpose of this article, we are assuming the host system for the purpose of this article as a black box that scales with the containers. In reality, there is more work needed.

Helm - The Package Manager for K8s

“Someone else must have already done this”… the world of Helm

As much as we like to believe that we are unique, many technology stacks and ways of working have been created before us, and would just need to be adapted through some templating to our use case… And that is HELM.

Helm is a collection of different, parametrized Kubernetes clusters, and these ready-to-go clusters are hosted in repositories (Artifiact Hub).

Instead of writing your Wordpress yaml as above in Kubernetes, just use the Helm chart. Want a Kubernetes cluster with complete Prometheus and Grafana monitoring? Got you. Want to use Traefik as your reverse proxy to your services? Say no more. Wanna have some advanced task scheduling using Apache Airflow? I mean, it is a pain to set up yourself, but luckily someone has done it before.

You get the idea.

Conclusion

The world of containers is wonderful and full of opportunities, and ways to manage your infrastructure in a controlled manner as code.

All though, it can very quickly get very complicated. The use of third party templates helps simplify the process but contains risks where you may be inheriting bad configurations or dangerous defaults.

That is why: Always scan configurations.

Your Kubernetes files, your docker-compose yamls, and any configurations inside containers. This way, you know that things are set up right and you do not run into the risk of being the next company in the news for a data breach.

CoGuard provides discovery and analysis tools for configuration files. We provide code scanning (or static analysis) that includes security benchmarks (OWASP, CIS, DoD STiGs, etc) and compliance frameworks. Get started today »

pip3 install coguard-cli

Scan your IaC repo:

coguard folder ./

Or output your current AWS cloud configuration and scan:

coguard cloud aws

We assume you have the AWS CLI with valid credentials and Docker installed and running. It also works for Google Cloud Platform and Azure.

Photo credit Lucas Alexander on Unsplash

DevOps Tips