Historical one room school complete with dunce cap. Things have come a long way in the classroom.

Dunce of the Week

That would be me.

Everything finally came together with the new system. Then I went and messed it all up.

The motherboard has a weak Ethernet. It is a 10/100 Ethernet, which is NOT a problem for a management interface. When I upgrade the box to have full redundancy, it will get a dual port fiber card.

What it does mean is that my Wi-Fi to it via a USB dongle is faster than if I were to plug it in.

Once the box was in position, I connected via Wi-Fi and finished configuration. I tested all the connectivity, and it all just worked.

At that point, I told it to join the cluster. It did with pleasure, and brought the cluster to a stop.

Did you catch my mistake? Yeah, I left that dongle in.

At the bottom of the barrel, we have 10base-T. I have some old switches in boxes that might support that. Above that is 100base-T, which is a good management speed. We can move data for upgrades and restores, but not the fastest. Some of my switches and routers do not support 100baseT.

Above that is where we start to get into “real” speeds. Gigabit Ethernet, or GigE. I’ve now moved to the next step, which is ports supporting 10G over fiber or cable, depending on the module I use. The next step-up would be 25Gbit. I’m not ready for that leap of cost.

Wi-Fi sits at around 200Mbit/s. Faster than “fast Ethernet” also known as 100base-T, but not at “real” speeds. Additionally, Wi-Fi is shared space, which means that it doesn’t always give that much.

So what happened? The Ceph(NAS) cluster is configured over an OVN logical network on 10.1.0.0/24. All Ceph nodes live on this network. Clients that consume Ceph services will also attach to this network. No issues.

When you configure an OVN node, you tell the cluster what IP address to use for tunnels back to the new node. All well and good.

The 10G network connection goes to the primary router and from there to the rest of the ceph nodes. One of the subnets holds my work server. My work server provides 20Tb to the ceph cluster.

On that subnet are also the wireless access points.

So the new node correctly sent packets to all the ceph nodes via the 10G interface, EXCEPT for traffic to my work server. Why? Because the 10G had a 1 hop cost, while the Wi-Fi had a 0 hop cost. By routing standards, the 200Mbit Wi-Fi was the closer, faster, connection than the 1 hop 10G connections.

When I found the connection problem and recognized the issue, I unplugged the Wi-Fi dongle from the new node and all my issues cleaned up, almost instantly.