It’s Late, Nerd Babble/status

We are in the process of moving from the image above to the image below.
Server room data center with rows of server racks. 3d illustration

At least in terms of what the infrastructure looks like.

Today I decommissioned an EdgeRouter 4 which features a “fanless router with a four-core, 1 GHz MIPS64 processor, 3 1Gbit RJ45 ports, and 1G SFP port.”

When they say “MIPS64” you can think of it as being in the same class as an ARM processor. Not a problem for what it is.

The issue was that there are only 1Gb interfaces. That and I’ve come to hate the configuration language.

This has been replaced with a pfSense router running on a TopTon “thing.” I call it a thing because it is from China and intended to be rebranded. It doesn’t have a real SKU.

It is based on an N100 with 4 cores and 8 threads. 2 2.5Gb Ethernet ports, 2 10Gb SFP+ ports. It can be upgraded and has multiple extras.

Besides the hardware, this is an entirely different animal in terms of what it can do. It is first, and foremost, a firewall. Everything else it does is above and beyond.

It is running NTP with a USB GPS unit attached. It runs DHCP, DNS, HAProxy, OSPF and a few other packages. The IDS/IPS system is running in notify mode at this time. That will be changed to full functionality very shortly.

So what’s the issue? The issue is that everything changed.

On the side, as I was replacing the router, I jiggled one of the Ceph servers. Jiggling it caused it to use just a few watts more, and the power supply gave out. It is a non-standard power supply, so it will be a day or two before the replacement arrives.

When I went to plug the fiber in, the fiber was too short. This required moving slack from the other end of the fiber back towards the router to have enough length where it was needed.

Having done this, plugging in the fiber gave me a dark result. I did a bit of diagnostic testing, isolated the issue to that one piece of fiber. I ran spare fiber to a different switch that was on the correct subnet, flashy lights.

Turns out that I had to degrade the fiber from the other router to work with the EdgeRouter 4. Once I took that off, the port did light off. But that was a few steps down the road.

Now the issue is that all the Wi-Fi access points have gone dark. Seems that they are not happy. This required reinstalling the control software and moving them from the old control software instance to the new one. Once that was done, I could see the error message from the access point complaining about a bad DHCP server.

After fighting this for far too long, I finally figured out that the pseudo Cisco like router was not forwarding DHCP packets within the same VLAN. I could not make it work. So I disabled the DHCP server on the new router/firewall and moved it back to the Cisco like router. Finally, Wi-Fi for the phones and everything seems to be working.

At which point I can’t log into the Vine of Liberty.

I can see the pages, I can’t log into the admin side. It is timing out.

3 hours later, I figured out that there was a bad DNS setting on the servers. The software reaches out to an external site for multiple reasons. The DNS lookup was taking so long that the connection was dropping.

I think this is an issue that I have just resolved.

But there’s more.

Even after I got the DNS cleaned up, many servers couldn’t touch base with the external monitoring servers. Why?

Routing all looked good, until things hit the firewall. Then it stopped.

Checking the rules, everything looks good. Checking from my box, everything works. It is only these servers.

Was it routing? Nope, that was working fine.

That was one thing that just worked. When I turned down the old router, the new router distributed routing information correctly and took over instantly.

So the issue is that pfSense “just works.” That is, there are default configurations that do the right thing out of the box.

One of those things is outbound firewall rules.

Anything on the LAN network is properly filtered and works.

But what is the definition of the LAN network? It is the subnet directly connected to the LAN interface(s).

Because I knew that I would need to be able to access the routers if routing goes wrong, my computer has a direct connection to the LAN Network attached to the routers. The Wi-Fi access points live in on the same subnet. So everything for my machine and the wireless devices “just worked”

The rest of the servers are on isolating subnets. That are part of the building LAN but they are not part of the “LAN Network”.

I know this, I defined an alias that contains all the building networks.

Once I added that to the firewall rules, it just worked.

Tomorrow’s tasks include more DHCP fights and moving away from Traefik. Which means making better use of the Ingress network.