My internal infrastructure is getting better and better. Unfortunately, it is still not stable enough.
The router is having issues with memory. I need to add more memory to fix the issues. The problem being that I need to take the router out of production to do so. I’ve not been willing to do that.
The symptom is that connections time out. The fix, restart HAProxy.
HAProxy forwards traffic to the ingress service. This should be running on multiple servers, but it currently is not. There is an issue which I have not resolved where communications from the second ingress service gets lost, leading to the gateway not responding.
This means that when the server that runs the ingress service has to reboot, all ingress stops.
The network is broken into segments, each segment is on a different subnet. Ceph prefers to be on a single subnet.
My solution was to use OpenVSwitch to create a virtual network for Ceph. This works great!
This adds a dependency on OpenVSwitch, which should not be an issue.
The underlaying physical network depends on good routing. The reason I don’t use static is that some nodes have multiple paths and I want there to be multiple paths for every node. This adds a dependency on the routing stack.
Free Range Routing, or FRR, is the solution. It supports OSPF, which is the correct routing protocol for internal routing. It just works.
Unfortunately, FRR and the Linux kernel will stop talking to each other. When this happens, we lose routing of the physical networks.
When we lose routing on the physical network, the OpenVSwitch network stops working.
If the OpenVSwitch network goes down, then the different Ceph nodes can’t talk to each other.
All of this is to say, I’m sorry for the issues you have been seeing with this site. Thank you for hanging in there.
I had to find the sick FRRs and restart them. Once that happened, everything came back to life.
Leave a Reply