For Lack of (nerd post)

Oh what a tangled web we weave when first we practice to ~~deceive~~be a system admin

I’ve been deep into a learning curve for the last couple of months, broken by required trips to see dad before he passes.

The issue at hand is that I need to reduce our infrastructure costs. They are out of hand.

My original thought, a couple of years ago, was to move to K8S. With K8S, I would be able to deploy sites and supporting architecture with ease. One control file to rule them all.

This mostly works. I have a Helm deployment for each of the standard types of sites I deploy. Which works well for me.

The problem is how people build containers.

My old method of building out a system was to create a configuration file for an HTTP/HTTPS server that then served individual websites. I would put this on a stable OS. We would then do a major OS upgrade every four years on an OS that had a 6-year support tail for LTS releases. (Long-Term Support)

This doesn’t work for the new class of developers and software deployments.

Containers are the current answer to all our infrastructure ills.

What is the advantage of a container? Each container is an isolated operating system. The developer decides what libraries to install, what versions of software is installed, and what tools are involved.

Since each container is isolated from every other container, it doesn’t matter if it is compromised, this cannot allow access to any other containers.

The method of building a container is to start with a known, trusted image. As an example, you can choose an image of Ubuntu 20.04, 22.04, or 24.04. With this as your base image, you use a control file to decide what applications and libraries are layered on top of this image.

Then your application is added to the build image you just created. Perfect isolation. Perfect environment. Perfect for your application.

Now for the downside. Everything is ephemeral, unless you configure the container to use some outside storage. Persistent storage.

In addition, containers are designed to run exactly one application. So a WordPress site will often have multiple containers to support it.

There is the web server, there is the database engine, there is the cache server (Redis or Memcached).

Besides all of the above, there is also the issue of networking. Do you let the container live on the actual net? No, you can’t, if you had two web servers, both would want to use the same ports.

In addition to all of the above, there are memory considerations.

The brunt of it is that to have a stable K8S container cluster, I consume nearly 80% of my resources running the cluster. This would not be an issue, if I had more clients so that I could afford more services.

Which means that k8s has to go.

Backups

For many years, I’ve used Amazon S3 for my backups. I use Amanda for backup control. This is a pull style of backup. Each client is configured to allow the server to connect and run software to dump file systems. Originally, this was the “dump” command.

The power of dump is in dump levels. At level 0, the complete file system is dumped. At level 1, only the changes from the latest level 0 are dumped, at level 2, only the changes from the level 1 are dumped.

This means that you have large dumps followed by smaller dumps.

A common method used, in the old days, was level 0 on the first day of the month. Level 1 every week, level 2 for all other days. With this system, it only takes reading 3 dumps to restore a file system.

In addition, you can recover individual files from backup dumps.

Great, except the “dump” only works on a limited number of file systems. Enter “Gnu Tape ARchive” or gtar.

“gtar” has the same level abilities as dump. It works on just about any file system. It just works.

With Amanda, you tell the system how long you want to be able to recover files, and it does the scheduling for you. In addition, it has tools to make it easy(ier) to restore files from a particular time.

I have a 4 mm DDS DAT drive. Each tape holds 4 GB. Other technology for DAT gets up to around 15 GB. Even using 15 GB tapes, it would take anywhere from 3 to 30 tapes to hold a dump. And that needs to happen every day. Not to mention how long it takes to write or read from tape.

The answer to tapes being too small is to use virtual tapes. This is when you treat a file or file system as a tape. Instead of writing to a physical tape, the software writes to a disk file. Now my file system is recorded in a single file.

With Amazon S3, I can store that single file for relatively low cost. Until you start to get into the Terabytes of storage.
Which is where I am. The client software doesn’t take up that much space, but my development environment does.

I needed to stop using Amazon S3.

Instead, I configured my ceph cluster to have a pool that is a little slower but which is not as wasteful of disk space.

Ceph

My original Ceph cluster consisted of two host machines plus a couple of virtual machines, each using a ZFS block device.

In short, a good toy, not production ready.

As I started the process of moving from a toy to production, my Ceph cluster started to thrash. It was scary bad.

It was moving data for days. Locking up. Never recovering. It was frustrating as heck.

The fix? Stop mucking around with virtual machines and ZFS block devices.

This required adding more nodes to the house network. So far 3 new nodes. I will be adding two more. These don’t need to be powerful CPU monsters. They just have to have good network and disk I/O.

One of the nodes came from a discard pile, two more were cheap used boxes from Amazon. Two more came from Dad, those aren’t in the Cluster yet.

Each node gets two 12 TB drives.

Once the nodes were properly installed and configured, the Cluster stabilized. I got rid of ZFS for everything except the boot and home drives.

This experience caused me to do a deep dive into Ceph, to learn what it was doing and why. In the end, I actually feel comfortable creating my own CRUSH maps.

Network

I have a couple of good switches at the house, and a few cheap as can be switches. The only thing that is true about them, is that they all have gigabit Ethernet ports.

This was plenty fast enough in the old days. But now we have Ceph running, things need to change.

The first step is to upgrade the switches where the different nodes are located from Gig-E to 10Gig-E. These switches aren’t all that expensive, in the grand scheme of things. What this will do is allow 10 Gbit traffic between switches and 1 Gbit from the switch to the node.

Then, as needed, I can add 10 Gbit network cards to the nodes to increase the speed. The increased speed will only be required if the disk I/O exceeds the network I/O. Which is not the case.

Mail Server

I can purchase a Google Account for $6/user/month. That $72 per year per user. If I want to have multiple domains, which I do, then I have to pay that $72/year/user/account. That adds up.

So what is the answer? To run my mail server, again. That requires a Mail Transport Agent (MTA) and a Mail User Agent (MUA).

When you are using Outlook or Thunderbird to interact with your mail, that is a type of MUA. It uses a network protocol to fetch mail and another protocol to send mail. The sending is done with SMTP and the fetching with IMAP3 or POP3.

There is a nice package that handles all of this. In a container system. It is called “mailu”. It has a K8S deployment script (Helm Chart), so with a bit of tweaking, I was able to deploy mailu to my k8s cluster.

Which caused the K8S cluster to become unstable. Mailu consists of 9 different containers, working together to provide a full-featured mail service.

Those 9 clusters consumed enough resources that it was causing my K8S nodes to become resource starved and kill the offending process. This meant it took my database engine down. It took my web servers down. It took my Ceph nodes down.

In short, it broke everything. Which is why everybody has been seeing so many 503 errors.

The other issue is that crashing a database engine will sometime corrupt a part of the database. When this happens, it requires manual intervention on my part, to recover. This happened to GFZ the other day.

Well, I’m pleased to report that Mailu is no longer on the K8S cluster. It has been moved to an internal docker swarm.

With mailu gone, the cluster is a bit more stable. There is more to do.

Docker Swarm

In the short, because I have a Ceph Cluster, I can have multiple docker hosts access the same files. This means that my persistent storage “just works”.

The downside? I’m still trying to wrap my head around the networking issues and how to publish ports correctly.

Regardless, I can generate a compose/stack file for deploying a stack (group of containers) that do what I need it to do.

I still need to get an ingress running in the swarm, I’m not sure how I will do that — yet. More research.

Ansible, bootstrap

One of the things I am not doing, is buying more nodes before I know exactly how I am going to set up those nodes, and being able to configure nodes rapidly.

This takes me back to Ansible. First, YAML is not a programming language. Second, Ansible is not designed to be “fast”, it is designed to be usable, reliable, and easy to use.

It meets those requirements.

For me, one of the largest issues is to get a node to a known state. When a node first comes up, it is going to have a “standard” software load. This load might not be what is needed, and the configuration might not be ready to use.

The first step of the bootstrapping process is to discover how to access the remote node. Depending on the node provider, there can be predefined users, there can be preinstalled ssh keys. There can be no users except for root. There might be no password authentication.

The user you log in with may need a password to do administrative (root) actions. Or it might not. Everything is unknown.

I attempted to write an ansible play to figure things out. I failed.

Then I made that amazing discovery. Ansible is just a method of tossing data structures from program to program.

I wrote a module that does nothing but discover the username and password to use to log into a node. It returns that information to Ansible.

Once that information is available to the play, I have a series of tasks that configure user access and secure the original access points.

It is as simple as telling my playbook that I want to bootstrap a node. When the bootstrap is completed, the rest of the playbook is free to assume standard access methods. This allows me to just install the software needed.

Conclusion

There is still the load balancer, the router ingress, the VPN, and a host of other issues to talk about. This is more than enough for tonight.

Oh, I forgot I had to learn new Bind9 capabilities to allow me to deal with internal IP addresses verse external IP addresses.

Progress is happening.

ByChris Johnson

Backups

Ceph

Network

Mail Server

Docker Swarm

Ansible, bootstrap

Conclusion

Related

By Chris Johnson

Related Post

Password Managers

Using AI

Complex Systems

One thought on “For Lack of (nerd post)”

The Vine of Liberty