Part of the task of making a High Availability system is to make sure there is no single point of failure.
To this end, everything is supposed to be redundant.
So let’s take the office infrastructure as a starting point. We need to have multiple compute nodes and multiple data storage systems.
Every compute node needs access to the same data storage as all the other compute nodes.
We start with a small Ceph storage cluster. There are currently a total of 5 nodes in three different rooms on three different switches. Unfortunately, they are not split out evenly. We should have 9 nodes, 3 in each room.
Each of the nodes currently breaks out as 15 TB, 8 TB, 24 TB, 11 TB, and 11 TB. There are two more nodes ready to go into production, each with 11 TB of storage.
It is currently possible to power off any of the storage nodes without effecting the storage cluster. Having more nodes would make the system more redundant.
Unfortunately, today, an entire room went down. What was the failure mode?
DHCP didn’t work. All the nodes in room-3 were moved to a new 10Gbit switch. Actual 4×2.5 2×10. The four 2.5Gbit were used to connect three nodes and one access point. One of the 10Gbit SFP+ ports was used as an uplink to the main switch.
When the DHCP leases expired, all four machines lost their IP addresses. This did not cause me to loss a network connection to them because they had static addresses on a VLAN.
What did happen is they lost the ability to talk to the LDAP server on the primary network. Because they had lost that primary network connection, no LDAP, no ability to log in.
The first order of repair was to reboot the primary router. This router serves as our DHCP server. This did not fix the issue.
Next I power cycled the three nodes. This did not fix the issue.
Next I replaced the switch with the old 1Gbit switch (4x1Gbit, 4x1Gbit with PoE). This brought everything back to life.
My current best guess is that the cat6 cable from room 3 to the main switch is questionable. The strain relief is absent and it feels floppy.
More equipment shows up soon. I’ll be pulling my first fiber in 25 years. The new switch will replace the current main switch. This is temporary.
There will be three small switches for each room. Then there will be a larger switch to replace the current main switch. The main switch will be linked with 10Gbit fiber to the 3 rooms in server rooms. The other long cables will continue to use copper.
Still, a lesson in testing.
The final configuration will be a 10Gbit backbone with OM4 fiber, the nodes will be upgraded to have 10Gbit NICs which will attach to the room switches via DAC cables. There will then be a 2.5Gbit copper network. The copper network will the default network used by devices.
The 10Gbit network will be for Ceph and Swarm traffic.
I’m looking foward to having this all done.
Comments
2 responses to “Networking, interrelationships”
To err is human, to really screw things up takes(a)computer(parts)
progress is almost always 3 steps forward (oops) and 2 steps back
You just ran into Lamport’s Law (after algorithms researcher Leslie Lamport): a computer you never heard of can screw up your whole day.
I view DHCP as something to be used sparingly. It’s useful for client nodes like my laptop (though even there, while the client uses DHCP the server assigns a fixed address). But for infrastructure nodes it seems safer and easier to set fixed addresses locally and avoid DHCP entirely.