But it is supposed to be more reliable!

Kubernetes (k8s) is a tool for running containers on a cluster with a high level of redundancy.

Consider the case where you have a website that must be up all the time. You containerize the website. This adds a gigantic level of security but still gives you great flexibility.

The site is running just like it is supposed to, and you decide it is time to upgrade. You “rollout” the next version of the software. The new software is deployed, it starts running and is fully functional. The K8S cluster then changes some internal routing, and all traffic that used to go to the old container now goes to the new container.

When all traffic to the old container is done, the old container is terminated. Total downtime, zero.

Now it does work this way, MOST of the time. There are a few complexities to this which can cause issues. Your website consists of three parts, the “assets”, the “software”, and the “database”.

Since the database is running in a different container, there is no issue on the rollout of the website.

The software is part of the container. No issues there.

The assets, though… Those live on disk. The new website container must have access to those assets.

K8S resolves this through a Storage Method. A storage method is just away for you to describe storage that exists outside the container and which persists even if the container goes away.

The reality is that the data exists on the “node” and the container must access the node storage. Some types of storage are all network and K8S has the tools to use it straight out of the box. Other storage methods are just ways of exposing the node’s file system to the container.

Consider these three methods, SAMBA, MS Shares, and NFS. All three of these methods are pure network based. You have some sort of server that serves the files over the network. K8S attaches the file systems directly to the container. Since the actual data exists back on the file server, it is possible to have multiple nodes access that data at the same time.

Now consider an iSCSI target. There is a file server that is serving that target. That target is exported as a block device. Think “hard drive”. When you attach it to a node, it is as if you had plugged a new hard drive into that node.

Once that block device is plugged in, you can do other work with it. Not as it is. You first have to format it format c: for Windows users or mkfs /dev/sda for Linux people.

This is fine. Multiple nodes can attach the same block device. If one node formats the block device, the other nodes see it as a formatted drive that can be mounted.

Unfortunately, this will lead to data corruption. Most file systems are not designed to be modified from multiple hosts at the same time. There are only a very few file systems that are capable of running across multiple nodes at the same time.

This is the issue we are currently having with Linode. We request an iSCSI block device from Linode. This is provided for use as a “persistent volume”. When a container needs access to a particular volume, it uses a persistent volume claim.

To access that volume, K8S tells the node to mount the device. The node then allows the container to mount that part of the file system. Because the file system is single use only, the PV can only be mounted on one node at a time.

To access that PV on a different node, the current node must detach the PV. The container must move to the other node, the other node must mount the PV and only then can the container access the data again.

On Monday, a node attached the PV and then died. It did not detach the PV. K8S instantly discovered that GFZ was down and started a new container on a different node. The new node asked to attach the PV. The old node was dead, so didn’t say a damn thing. For 6 hours it sat like that until we destroyed both the new and old node and got a new new node, at which point everything started working again.

This happened because Linode forced an upgrade of our K8S, and it did not go cleanly.

On Tuesday, we upgraded K8S again. This time the new nodes all came up. Unfortunately, the node which GFZ was attempting to run on refused to attach the PV. Once Linode had looked at the issue, we kicked that node hard and it all started working again.

We need to have something that provides reliable, robust, multihost access. The answer is something called “ceph”. That was my fun for the day. Learning enough about ceph to allow us to migrate to using it for our persistent volumes.


Comments

7 responses to “But it is supposed to be more reliable!”

  1. Travis Roberts Avatar
    Travis Roberts

    I’ve been running Windows 3 node clusters for about a decade without having taken the Kubernetes plunge. Those clusters connect to a SAN via SAS, (iSCSI before SAS was a thing). Maybe I’m just lucky, but I’ve never had anything go down as hard as you’ve described.
    Regardless, my hat’s off to you for embracing so many different technologies at once. That is no easy task.

    1. I have something like 50TB of spinning storage attached to two machine in the house via SATA and SAS. I’ve used the technology. I have a project that I’ve put on the back burner to bring another 2050TB on line via SAS.

  2. curby Avatar

    Is this why every time I get an “update” it fubars what was supposed to “new and improved “???
    Im an old knuckle dragger so this stuff isn’t my forte…. Yall are doing a brilliant job keeping us happy..

  3. Crawford Avatar
    Crawford

    I’d always take an outage for an infrastructure upgrade, especially with a shared, writable resource. But I know Linode likes to force upgrades if you don’t get to it on their schedule.

    And for all the ability of K8S to give you zero-downtime application updates, my employer still insists on off-hour deploys. If half my team weren’t in India, I’d be pushing back because I’m too old to get up in the middle of the night.

    1. The “forced upgrade” was entirely my fault. I missed their 48 hour warning.
      .
      The issue with Linode is that they don’t offer a ReadWriteMany. They only offer ReadWriteOnce.

  4. Birdog357 Avatar
    Birdog357

    I know you’re using language, but I’ll be damned if I have a clue what you are talking about…

    1. So what is your preferred 45-70 load? *GRIN*
      .
      That you even read a part of this is amazing. I’m ranting because I’m just upset.