Nerd Babel

Filler

I’m exhausted. I’ve been pulling fiber for the last two days. All part of an infrastructure upgrade.

Normally, pulling cable in a modern datacenter is pretty easy. This is not a modern datacenter.

The original cable runs were CAT6 with RJ45 connectors. When the cables were installed, the installation had to be nondestructive. No holes in walls, no holes in floors. Hide the cables as best you can.

One of the cables we removed was to a defunct workstation. It had been run across the floor and then covered with a protective layer to keep it from getting cut or snagged. The outer insulation had been ripped away. There was bare copper showing. Fortunately, that particular workstation hasn’t been in place for a few years.

The backbone switch was mounted in the basement. Not a real issue. The people who pulled some of the last cable didn’t bother to put in any cable hangers. So it had loops just dangling.

There were drops that could not be identified. Those are now disconnected, but nobody complained, so nothing was taken offline.

I’ve found a new favorite cable organizer.

Cable Management Wire Organizer

These are reusable. They open fully and will hold many cat6 and even more fiber. They have the 3M foam double-sided tape on them. This works great against smooth, clean surfaces.

The place where they shine is that they also have a hole designed for a #6 screw. In places where there were no smooth surfaces, much less clean surfaces. The sticky held them in place long enough to drive a screw.

There are no more dangling cables.

My only hope is that there are no more configuration issues with the new switch. *caugh*DHCP*caugh*

Networking, interrelationships

Part of the task of making a High Availability system is to make sure there is no single point of failure.

To this end, everything is supposed to be redundant.

So let’s take the office infrastructure as a starting point. We need to have multiple compute nodes and multiple data storage systems.

Every compute node needs access to the same data storage as all the other compute nodes.

We start with a small Ceph storage cluster. There are currently a total of 5 nodes in three different rooms on three different switches. Unfortunately, they are not split out evenly. We should have 9 nodes, 3 in each room.

Each of the nodes currently breaks out as 15 TB, 8 TB, 24 TB, 11 TB, and 11 TB. There are two more nodes ready to go into production, each with 11 TB of storage.

It is currently possible to power off any of the storage nodes without effecting the storage cluster. Having more nodes would make the system more redundant.

Unfortunately, today, an entire room went down. What was the failure mode?

DHCP didn’t work. All the nodes in room-3 were moved to a new 10Gbit switch. Actual 4×2.5 2×10. The four 2.5Gbit were used to connect three nodes and one access point. One of the 10Gbit SFP+ ports was used as an uplink to the main switch.

When the DHCP leases expired, all four machines lost their IP addresses. This did not cause me to loss a network connection to them because they had static addresses on a VLAN.

What did happen is they lost the ability to talk to the LDAP server on the primary network. Because they had lost that primary network connection, no LDAP, no ability to log in.

The first order of repair was to reboot the primary router. This router serves as our DHCP server. This did not fix the issue.

Next I power cycled the three nodes. This did not fix the issue.

Next I replaced the switch with the old 1Gbit switch (4x1Gbit, 4x1Gbit with PoE). This brought everything back to life.

My current best guess is that the cat6 cable from room 3 to the main switch is questionable. The strain relief is absent and it feels floppy.

More equipment shows up soon. I’ll be pulling my first fiber in 25 years. The new switch will replace the current main switch. This is temporary.

There will be three small switches for each room. Then there will be a larger switch to replace the current main switch. The main switch will be linked with 10Gbit fiber to the 3 rooms in server rooms. The other long cables will continue to use copper.

Still, a lesson in testing.

The final configuration will be a 10Gbit backbone with OM4 fiber, the nodes will be upgraded to have 10Gbit NICs which will attach to the room switches via DAC cables. There will then be a 2.5Gbit copper network. The copper network will the default network used by devices.

The 10Gbit network will be for Ceph and Swarm traffic.

I’m looking foward to having this all done.

Server room data center with rows of server racks. 3d illustration

Docker Swarm?

There is this interesting point where you realize that you own a data center.

My data center doesn’t look like that beautiful server farm in the picture, but I do have one.

I have multiple servers, each with reasonable amounts of memory. I have independent nodes, capable of performing as ceph nodes and as docker nodes.

Which took me to a step up from K8S.
Read More

Server room data center with rows of server racks. 3d illustration

High Availability Services

People get very upset when they go to visit Amazon, Netflix, or just their favorite gun blog and the site is down.

This happens when a site is not configured with high availability in mind.

The gist is that we do not want to have a single point of failure, anywhere in the system.

To take a simple example, you have purchased a full network connection to your local office. This means that there is no shared IP address. You have a full /24 (255) IP addresses to work with.

This means that there is a wire that comes into your office from your provider. This attaches to a router. The router attaches to a switch. Servers connect to the server room switch which connects to the office switch.

All good.

You are running a Windows Server on bare metal with a 3 TB drive.

Now we start to analyze failure points. What if that cable is cut?

This happened to a military installation in the 90s. They had two cables coming to the site. There was one from the south gate and another from the north gate. If one cable was cut, all the traffic could be carried by the other cable.

This was great, except that somebody wasn’t thinking when they ran the last 50 feet into the building. They ran both cables through the same conduit. And when there was some street work a year or so later, the conduit was cut, severing both cables.

The site went down.

Read More

Closeup hands try to solve the confused ropes on white background, psychotherapy, mental complex

For Lack of (nerd post)

Oh what a tangled web we weave when first we practice to deceivebe a system admin

I’ve been deep into a learning curve for the last couple of months, broken by required trips to see dad before he passes.

The issue at hand is that I need to reduce our infrastructure costs. They are out of hand.

My original thought, a couple of years ago, was to move to K8S. With K8S, I would be able to deploy sites and supporting architecture with ease. One control file to rule them all.

This mostly works. I have a Helm deployment for each of the standard types of sites I deploy. Which works well for me.

The problem is how people build containers.

My old method of building out a system was to create a configuration file for an HTTP/HTTPS server that then served individual websites. I would put this on a stable OS. We would then do a major OS upgrade every four years on an OS that had a 6-year support tail for LTS releases. (Long-Term Support)

This doesn’t work for the new class of developers and software deployments.

Containers are the current answer to all our infrastructure ills.

Read More