Taking care of infrastructure

I started caring about computer infrastructure in the early 1980s. We feed our computer via punch cards, 9 track tape, and a few dozen hardwired terminals at 4800 baud.

We upgraded our network. We got our IBM 3090 on BITNET. I learned more about networking.

We upgraded to 10base2 when our Sun 360s arrived. More and more of campus had Ethernet.

When I arrived in Maryland, I was babysitting some Super Computers. There were nearly 1000 computers hooked up to the network. Most of those were running some variation of Unix.

To keep all of those machines up-to-date took a highly skilled team of system administrators. They handled all the machines on campus except for the Super Computers, which my team took care of.

If they needed help, the team could call on my Mentor’s team. His team was part of the group of people that defined the Internet. Yes, really.

That support team spent about 25% of their time caring for around 800 Unix machines. They spent the other 75% trying to care for the Apple’s and Microsoft machines. The workload was getting greater and greater as more and more Microsoft and Apple machines came on campus.

By the time I left, they had to increase the size of that support team from two skilled workers, to four skilled workers. 2 of them did nothing but Microsoft support.

The number of Unix boxes increased and still was taking less than 20% of the teams efforts.

I wish that was still the case.

To support multiple machines, you need to be able to reach out, remotely, and configure those machines in standard ways. When you are done, you want a system that just works when you sit down to do work.

I first learned “Puppet”. This is a pull style of configuration management. It provides the advantage that the clients can reach out to the puppet master from behind firewalls to accomplish the tasks required.

What this meant to me was that I was able to create an installation image that our developers could use to bring up a machine, either bare metal, or virtual. Once that machine had booted, it called home and configured itself.

Over time, it continued to keep itself properly configured. I could make a configuration change on the puppet master and within 24 hours, every machine had the new configuration update.

The issue, of course, was that I had to wait until the client decided to call home before anything happened. This could take up to a day to accomplish. This doesn’t work when you need to push out repairs.

In addition, many of the “rules” required extensive coding in strange languages, to accomplish anything. Your ability to “know” something about a system was only held in the flow of control.

I had 1000s of lines of code to maintain our infrastructure.

Enter my current infrastructure.

I’m in the process of migrating away from K8S. The cost of using K8S is too high and the reliability just isn’t there without putting more money into servers.

The gist of K8S is that you set up a number of “nodes”. Each node acts as a master or a worker. All nodes run containers. You deploy containers to the K8S cluster and magic happens.

The reality is that K8S consumes resources like mad. With 5 nodes, I can’t keep a dozen services running. I should be able to do much more than a dozen services.

The issue is that I run out of memory. When I run out of memory, the node kills a process. It often kills an important container, at which point I have services go away.

In addition, there is a lot of duplication of effort. There is the load balancer, which sends requests to the Ingress service. The ingress service then finds the container which is providing the named service. That named service is often running the same software as the Ingress server, but it has to do the same work over again.

It isn’t uncommon for a K8S cluster to be running multiple database engines because that is the ‘easy’ way to deploy a service. Even though, a single database engine is more than capable of handling all database usage.

So much duplication of effort.

So what are my requirements for the new infrastructure?

I need to be able to configure a newly booted system from barely functional to being part of the infrastructure.

I took this in multiple steps.

The first step was to create an LDAP service, which contains all the directory information my servers need. This is a collection of users, of hosts, groups, and other references.

This is just the skeleton, it is the scheme that we will use. It is partially populated.

Once this is in place, the next step can happen.

This is the first ansible playbook. It is my first attempt, so it is very crude. This reaches out to an IP address, collects basic facts from the new system. It then installs and configures the new system to use LDAP. It then populates the LDAP hosts with the names and IPs from the newly discovered host.

It then verifies the new connection methods, closes out the installation access and presents the server ready to configure.

The only thing that I have not been able to do, so far, is to record an updated inventory. I’ll figure that out soon enough. ansible is that powerful.

Once I have manually updated the correct inventories, I can then configure the new service by running a playbook that configures that server to provide that service.

All in all, ansible is proving to be very capable.


Comments

5 responses to “Taking care of infrastructure”

  1. pkoning Avatar
    pkoning

    I have some exposure to Ansible and find it utterly inscrutable, not least because Yaml is a scripting notation with no perceptible definition. Whenever I can I use Python tools instead.

  2. It's just Boris Avatar
    It’s just Boris

    Would that perchance be a Cray in the picture?

  3. Chris Johnson Avatar
    Chris Johnson

    Not just a Cray X/MP, but my X/MP.

    1. pkoning Avatar
      pkoning

      You own a Cray? Wow. Does it work?

  4. Slow Joe Crow Avatar
    Slow Joe Crow

    Interesting, in a previous job we used puppet and it seemed to do everything in reasonable time. It may have helped that a lot stuff was simply cloned VMs. I have looked Chef which is popular in large environments and uses Ruby which is a relatively straightforward scripting language.