Why Is It So Slow? Or How Many Bottlenecks?

My mentor, Mike, use to say “There is always a bottleneck.”

What he meant by this, was that for any system, there will be a place which limits the throughput. If you can find, and eliminate, that bottleneck, then you can improve the performance of the system. Which will then slam into the next bottleneck.

Consider this in light of traffic. It is obvious to everybody, because it happens every day, that traffic does a massive slowdown just past the traffic signal where the road goes from four lanes to two. That is the point which we want to optimize.

The state comes out, evaluates just how bad the bottleneck is. The money people argue, and 15 years later they widen the road.

They widen the road between the first and second signal. Traffic now clears the first traffic signal with no issues.

And the backup is now just past the second signal, where the road narrows again.

We didn’t “solve” the bottleneck, we just moved it.

With computers, there are many bottlenecks that are kept in balance. How fast can we move data to and from the network, how fast can we move data to and from mass storage, how fast can we move data from memory? These all balance.

As a concrete example, the speed of memory is not fixed at the speed of the socket. If there are more memory lanes or wider memory lanes, you can move data faster.

If you have a fast CPU, but it is waiting for data from memory, it doesn’t matter. The CPU has to be balanced against the memory speed.

My mentor was at a major manufacturer, getting a tour and an introduction to their newest machine. He had an actual application that could also be used for benchmarking. One of the reasons it was a powerful benchmarking tool, was that it was “embarrassingly parallel”.

In other words, if it had access to 2 CPUs, it would use them both and the process would run twice as fast. 8 CPUs? 8 times as fast. Since the organization he worked for purchased many big computers (two Crays), and he was the go-to guy for evaluating computers, his opinion meant something.

He ran his code on a two CPU version, found it adequate. Requested to look at the actual designs for the machines. He spent an hour or two pouring over the design documents and then said.

“We want an 8 CPU version of this. That will match the compute (CPU) power to the memory bandwidth.”

The company wasn’t interested until they understood that the customer would pay for these custom machines.

Six months later, these 8 custom machines were in the QA bay being tested when another customer came by and inquired about them.

When they were told they were custom-builds, they pulled rank and took all 8 of them and ordered “many” more.

What happened, was that my mentor was able to identify the bottleneck. Having identified it, he removed that bottleneck by adding more CPUs. The new bottleneck was no longer the lack of compute power, it was memory access speed.

The Tight Wire Balancing Act

I deal with systems of systems. It is one of the things that I was trained in. I.e., actual classes and instruction.

Most people have no idea of how complex a modern Internet service is. I.e., a website.

This site is relatively simple. It consists of a pair of load balancers sitting in front of an ingress server. The ingress server runs in a replicated container on a clustered set of container servers. The application has a web service provider that handles assets and delegates execution to an execution engine.

This runs a framework (WordPress) under PHP. On top of that is layered my custom code.

The Framework needs access to a database engine. That engine could be unique to just this project, but that is a waste of resources and does not allow for replication. So the DB Engine is a separate system.

The DB could run as a cluster, but that would slow it down and adds a level of complexity that I’m not interested in supporting.

The DB is then replicated to two slaves with constant monitoring. If the Master database engine goes offline, the monitors promote one of the slaves to be the new master. It then isolates the old master so it does not think it is the master anymore.

In addition, then non promoted slave is pointed at the new master to replicate.

I wish it was that simple, but the monitors also need to reconfigure the load balancers to direct database traffic to the new master.

And all of this must be transparent to the website.

One of the issues I have been having recently, is that in the process of making the systems more reliable, I’ve been breaking them. It sounds stupid, but it happens.

So one of the balancing acts, is balancing redundancy against complexity, against security.

As another example, my network is physically secured. I am examining the option of running all my OVN tunnels over IPsec. This would encrypt all traffic. This adds a CPU load. How much will IPsec “cost” on a 10 Gigabit connection.

Should my database engines be using SSD or rust? Should it be using a shared filesystem, allowing the engine to move to different servers/nodes?

It is all a balancing act.

And every decision moves the bottlenecks.

Some bottlenecks are hard to spot. Is it a slow disk or is it slow SATA links or is it slow network speed?

Is it the number of disks? Would it be faster to have 3 8TB drives or 2 12TB drives? Or maybe 4 6TB drives? Any more than 4 and there can be issues.

Are we CPU bound or memory bound? Will we get a speedup if we add more memory?

Conclusion

I ave so many bottles in the air I can’t count them all. It requires some hard thinking to get all the infrastructure “right”