People get very upset when they go to visit Amazon, Netflix, or just their favorite gun blog and the site is down.
This happens when a site is not configured with high availability in mind.
The gist is that we do not want to have a single point of failure, anywhere in the system.
To take a simple example, you have purchased a full network connection to your local office. This means that there is no shared IP address. You have a full /24 (255) IP addresses to work with.
This means that there is a wire that comes into your office from your provider. This attaches to a router. The router attaches to a switch. Servers connect to the server room switch which connects to the office switch.
All good.
You are running a Windows Server on bare metal with a 3 TB drive.
Now we start to analyze failure points. What if that cable is cut?
This happened to a military installation in the 90s. They had two cables coming to the site. There was one from the south gate and another from the north gate. If one cable was cut, all the traffic could be carried by the other cable.
This was great, except that somebody wasn’t thinking when they ran the last 50 feet into the building. They ran both cables through the same conduit. And when there was some street work a year or so later, the conduit was cut, severing both cables.
The site went down.
Most good routers today have multiple ports. So both cables come to the router and plug into two ports in the same router.
This means that if the router fails, is turned off while updating, or any of a dozen other things, the entire site loses access to the Internet.
So you put a second router in place. Now each cable goes to a dedicated router. Now if we lose a router, the other router and cable takes over. If we lose a cable, the other cable and router takes over.
Except, you have a router out of production for an upgrade. The OTHER cable goes dark. Now you have lost connectivity because of two single failures.
So we are back to fixes. The fix is that each cable terminates in a dedicated switch. That switch has a cable going to each of the routers. This means that both routers have access to both cables. We now have High Availability to the building LAN.
We get HA over the network but running two or more physical networks. We attempt to keep the physical cables and equipment separated from each other. We don’t want some remote hands unplugging a bundle and taking out both networks.
So now we reach the actual server. That server has two network cards in it. This gives us the redundancy we need. Except, if one of those network cards fails, we have to take the entire system down to replace it.
The fix is to have two servers with two network cards, each of which is capable of routing all the traffic required. We now have two servers up and running, serving the same content.
Except, is it really?
The servers have a 3 TB drive each. This is fine.
Except, that each drive is independent of the other drive. There needs to be something that keeps the drives perfect copies of each other.
This analysis goes on and on and on.
When I was doing work for the military, we had to write a disaster recovery plan.
The plan started with “Somebody tossed a hand grenade into the machine room, taking out disk drives, tape drives, and computer.”
Part of the plan was to bring a DD49 drive from one site to the other site, cable it up, then write the contents needed to boot the other system. Once we had written the boot code, the DD49 disk drive could be taken back and wired back in.
We had to consider every aspect.
High availability planning makes that disaster recovery plan look simple.
Bluntly, I don’t attempt HA systems. I want to have it. I work towards it. I have not made it there yet.
I do have a few sites that are very resilient. This site will be as well, once the infrastructure is cleaned up.
Suffice to say, much of the work I am doing is based on having fail over and fail safes to create an HA infrastructure without breaking the bank.
Instead of solving every potential issue, we solve the most likely issues.
Leave a Reply