People get very upset when they go to visit Amazon, Netflix, or just their favorite gun blog and the site is down.
This happens when a site is not configured with high availability in mind.
The gist is that we do not want to have a single point of failure, anywhere in the system.
To take a simple example, you have purchased a full network connection to your local office. This means that there is no shared IP address. You have a full /24 (255) IP addresses to work with.
This means that there is a wire that comes into your office from your provider. This attaches to a router. The router attaches to a switch. Servers connect to the server room switch which connects to the office switch.
All good.
You are running a Windows Server on bare metal with a 3 TB drive.
Now we start to analyze failure points. What if that cable is cut?
This happened to a military installation in the 90s. They had two cables coming to the site. There was one from the south gate and another from the north gate. If one cable was cut, all the traffic could be carried by the other cable.
This was great, except that somebody wasn’t thinking when they ran the last 50 feet into the building. They ran both cables through the same conduit. And when there was some street work a year or so later, the conduit was cut, severing both cables.
The site went down.
Most good routers today have multiple ports. So both cables come to the router and plug into two ports in the same router.
This means that if the router fails, is turned off while updating, or any of a dozen other things, the entire site loses access to the Internet.
So you put a second router in place. Now each cable goes to a dedicated router. Now if we lose a router, the other router and cable takes over. If we lose a cable, the other cable and router takes over.
Except, you have a router out of production for an upgrade. The OTHER cable goes dark. Now you have lost connectivity because of two single failures.
So we are back to fixes. The fix is that each cable terminates in a dedicated switch. That switch has a cable going to each of the routers. This means that both routers have access to both cables. We now have High Availability to the building LAN.
We get HA over the network but running two or more physical networks. We attempt to keep the physical cables and equipment separated from each other. We don’t want some remote hands unplugging a bundle and taking out both networks.
So now we reach the actual server. That server has two network cards in it. This gives us the redundancy we need. Except, if one of those network cards fails, we have to take the entire system down to replace it.
The fix is to have two servers with two network cards, each of which is capable of routing all the traffic required. We now have two servers up and running, serving the same content.
Except, is it really?
The servers have a 3 TB drive each. This is fine.
Except, that each drive is independent of the other drive. There needs to be something that keeps the drives perfect copies of each other.
This analysis goes on and on and on.
When I was doing work for the military, we had to write a disaster recovery plan.
The plan started with “Somebody tossed a hand grenade into the machine room, taking out disk drives, tape drives, and computer.”
Part of the plan was to bring a DD49 drive from one site to the other site, cable it up, then write the contents needed to boot the other system. Once we had written the boot code, the DD49 disk drive could be taken back and wired back in.
We had to consider every aspect.
High availability planning makes that disaster recovery plan look simple.
Bluntly, I don’t attempt HA systems. I want to have it. I work towards it. I have not made it there yet.
I do have a few sites that are very resilient. This site will be as well, once the infrastructure is cleaned up.
Suffice to say, much of the work I am doing is based on having fail over and fail safes to create an HA infrastructure without breaking the bank.
Instead of solving every potential issue, we solve the most likely issues.
Comments
11 responses to “High Availability Services”
what I find frustrating with this modern day life- EVERY single app, program, thingy, whiz bang gets “updated” and the new “update” SUX. it works sssslllooowwwweeerrrr IF it works at all. it requires 10x the “touches” on the touch screen than before. it “buffers” 400 times more often.. and speaking of touch screen- whose idea was it to put every control in these new vehicles on a flippin touch screen???? distracted driving!! distracted driving!!! they scream. then they put everything on a tv screen you have to squint at to fukkin use! glad its only in my company work van… I will NEVER buy a new vehicle for personal use… maybe yall could enlighten some of us knuckle dragging neanderthals..
Then you have to think of edge cases. When the hurricane hit NYC a few years ago a lot of supposedly HA sites went down because the backup generators were in the basement, which was flooded. A similar story is the company that wanted to be “green” so they used bio diesel for their generators and 6 months later it had gelled and the generators wouldn’t run.
Some of our clients run firewalls and servers in HA pairs and a second ISP. a lot don’t because they can survive an outage
I got talked into “bio diesel “ for my furnace one winter. 10% bio fuel. after 5 nozzles and 3 filters on the tank in a month and a half I told them to bring me real fuel… internet and “modern “ wifi will go away around here in any minor or major event. I can get along without it..my wife not so much
Corporate bean counters who have the final say in how jobs are done want network redundancy. But they don’t want to PAY for network redundancy. When this narrow minded penny pinching mentality meets real world conditions the inevitable result is loss of service. Having been a network admin for a hospital system for more than a decade….as a “second hat”….my primary responsibility being imaging patients….I saw this short sighted insanity routinely. They wanted everything done as cheaply as possible. And when the inevitable failures happened compromising patient care they would be livid wanting to know who’s fault it was. And they would become positively apoplectic when told it was THEIR fault for refusing to budget and plan for failures.
the bean counters have infected most every large businesses… the “dei” word is anylist…usually a 20 something college “educated” moron. proof of concept is one of the new big words. I have taught myself to just do my job and find ways to work around thier lunacy..
The bean counter thing may be how Boeing ended up with MCAS without adequate redundancy. From what I read when those 737s first drilled themselves into the ground, that system has a single angle-of-attack sensor feeding its algorithms.
I’m not an aeronautical engineer, nor even a pilot (just a skydiver), but I looked at this and said to myself: hm, a control system whose specific purpose is to push the nose down (i.e., to aim the aircraft towards the ground), that’s a safety-of-flight critical system. An SoC critical system with non fault tolerant sensor inputs? Surely you’re kidding me…
I’ve worked on building network and storage systems for decades, and HA is a big piece of the design effort. For networks, it’s redundant boxes (routers and switches) more wires, and careful design of the protocols and algorithms so they detect outages and route around them quickly enough that applications don’t fail. That last part can be tricky. For one thing, it means that the system has to react, guaranteed, within N seconds. It’s not good enough to be fast enough on average, if some of the time it takes longer to reroute than it takes for applications to die. (We liked to joke that the HA deadline is “less time than it takes the customer to get p**d off and call Customer Support”.)
For storage, a big piece is RAID — multiple disk drives that together store the data. A classic setup is RAID-1, a.k.a., “mirroring” which is just what it sounds like. More efficient storage-wise but slower are RAID-5 and RAID-6 which use one or two extra drives per group of data drives. The hard part for all these is the software for the error case. The “happy path” is easy, but it’s only about 10% of the total work.
For storage systems especially, I learned that “hot swap” is a big deal, and not an easy one. You build the hardware with redundant components (disk drives, power supplies, line cards). But instead of requiring a system shutdown to replace a failed part, which defeats the point of being redundant, you can pull out the bad part and plug in a replacement while the system keeps running. That requires careful design both at the electrical level and in the software, but it’s doable and it makes for systems that can keep on going for years on end.
Main and backup path fibers run into building via the same conduit, and the two dirctions on a SONET ring on the same pole, are both referred to as ‘redumbdancy..’
A variation appeared in the telephone system of the nearby town of Hollis, NH. At the time it was run by a local guy, who also sold fuel oil. Their SONET “ring” was only connected to the outside world at one end, in other words it was installed with the first fault already in place.
So when a squirrel chewed through that one connection, the entire town disappeared from view, including 911 service. I think it took a day to fix the fault; it’s not clear that the underlying mistake was ever corrected.o
I learned a bit about fault tolerance design from a Navy guy, who talked to me about FDDI networks aboard ship (normally two rings, with single fault tolerance). The Navy setup ran one ring on the port side and one on the starboard side. I asked why; the answer “so a torpedo won’t take out both of them. Oh. “fault tolerance” takes on an entirely different meaning when enemy action is involved, something I as a civilian wasn’t used to thinking about. Then again, in studying encryption you do learn to think of “data errors” created intentionally and maliciously rather than random things resulting from noise.
the town that i live in is running telephone and internet on buried lines that were put in in 1947….good times..
One of our local offices in a neighboring town dumped completely off of our network. What is really nuts is that I -know- it had two physically separate fiber paths, and even a private microwave T1 circuit for a last ditch connection.
Well, the T1 was up so we could at least telnet into the routers and stuff. Lo and behold ALL the fiber paths through that com room were down, including SEPARATE fiber paths for the SCADA system as well as the biz net..
Ok, beat feet to that office, and on the way there, start looking for car wrecks, fires, etc, that could get that group of cables to fail SOMEWHERE on the path.
Nothing.
Finally start pulling out the 19″ rack mount fiber fusion splice trays, and out of the 24 fibers in those trays, I think four were still intact.
All of the broken ones had at least two chews through them, resulting in a pile of 1/2″ to 1″ fiber segments scattered in the trays, interspersed with numerous rat turds.
Never found the rat. (I think it was a small pack rat as I’d seen them out in the linemans’ warehouse caught in the traps.)
I hope the little bastard at least got a bad stomach ache..