Nerd Babel

Computer binary code zero one numbers

It Is All Ones and Zeros

There are times when I wish I had entered the Computer Science world a few years earlier.

My mentor was born four years before I was. He went through the computer science program four years before I did at a prestigious university.

His academics took place as electrical engineering was morphing into computer science. Which meant that he was taught circuit design at a lower level than we were taught. He was taught how to design transistors and to build the hardware.

My class was taught bits, bytes, and words. We built on what his generation was learning as computers advanced.

A bit is the smallest amount of data that a computer can store. In fact, it is the only type of data that a computer can store.

Bits are grouped into bytes and words. Today, a byte is 8 bits long; in past years, it could be larger or smaller than that. I’ve worked on machines where a byte was defined as 12 bits.

Bits, by themselves, have no meaning. A programmer assigns meaning to a sequence of bits. Let’s take a byte of 8 bits, for example. The byte we will be looking at has a sequence of bits like this 0011 0001.

Again, this is meaningless until we assign meaning to it. If we say that it is an unsigned tiny integer, then a byte represents a value between 0 and 255, while our byte represents the integer 49. 0*128 + 0*64 + 1*32 + 1*16 + 0*8 + 0*4 + 0*2 + 1*1 = 49 base 10. We can also express that integer in base 16 as 0x31 or in base 8 as 061.

We could define it as a signed tiny integer; then a byte represents a value between -127 and 128. But one of the most magical ways of looking at the byte is as a character. In this format we have a table of values to glyphs or characters. The value 0x31, 49, 061, 0011 0001 is interpreted as the character “1”. In the same way, the value 0x41, 65, 101, 0100 0001 is interpreted as the character “A”.

In other words, a bit pattern doesn’t have meaning until we define what the value means.

Primitive Types

The CPU in a computer has several registers. Each register holds a bit pattern of a given size. The CPU can then manipulate registers with a fixed set of instructions. Those instructions define the meaning of the register for that instant. If we use the integer add operation, then the two registers are treated as integers with the result being stored in a third register. If we use floating point operations, then we treat the registers as 32 or 64 bit floating point numbers. Doubles are 128 or 64 bits. We can treat the registers as containing one byte or character or a multiple characters. Or, we can treat each bit as a boolean.

For languages, we normally have integer, unsigned integer, float, double, character, and string. These are all referred to as primitives.

While we have defined the type, we have not defined the meaning. For example, 1234.70 is a floating point number. But what does it mean.

It could be a price, a quantity, a physical measurement. If it is a physical measurement, then it is expressing units.

It is the meaning we give values that allows humans to interact with the data.

Formatting

Let’s say we are working with a basic product object. Each product has a SKU, price, description, and quantity in stock. We will call these “labels”. We give a primitive type to each. SKU=>string, price=>float, description=>string, quantity=>integer.

This is a good start, but we also define how we will format these values when we display the product for a user. We can say that SKU and description will be left-aligned, price will be formatted as currency ($x,xxx.xx) right-aligned, while quantity will be formatted as an integer (x,xxx) right-aligned. This formatting is encoded in the knowledge of the meaning of the labeled data.

Formatting is not part of the data; it is associated with the label. The label allows us to assign meaning to the data.

Viewing Data

Humans have a difficult time applying meaning to bit patterns, so each primitive type has a standard text format. This allows us to see the values of the data.

For example, we say that strings are input and displayed as quoted strings, “This is an example string”. Integers are input and displayed as an optional negative sign followed by a sequence of digits, 1-9 as the first character followed by 0 or more 0-9 characters. Or it can be a 0. This defines a base 10 integer. If integer starts with 0x, then what follows the x is an integer in base 16. if the first character after the sign is a 0, then it is octal.

Simple.

These rules for displaying and inputting values are well defined.

Information Interchange

This is a gigantic subject. We are going to barely touch on it. To transfer information in a meaningful way, we have to define the meaning of each datum that is exchanged.

There are specific tools for doing this. XML, JSON, YAML, SOAP, and others are designed specifically for this process.

Unfortunately, there are de facto “rules” for exchanging data. Rules that must be followed but that the people using them do not understand.

Excel, Word, and Other MicroSoft Monstrosities

The default for most people when exchanging a table of information is an Excel file, an xls file. How I hate this.

An Excel file gives labels to values and adds formatting but does not add meaning. Meaning comes from external sources.

So, we might have a two cells we are looking at; B1 has a value of “Price”. It is formatted as bold text, centered. B2 has a value of 7.50. It is formatted as text, so it displays as “$7.50”. If the cell was formatted as “number” or “general” it would display as 7.5.

It is the user who applies these formatting rules. It is the user who provides meaning to these values.

If you have an application that can read and display Excel sheets, then all is good.

But Excel sheets are not a great way to exchange data. As a matter of fact, they suck. Each cell must contain both formatting and values. There are linkages between cells and a hundred other things that can be added. They are painful to create programmatically.

The Savior of Data Interchange, CSV

It doesn’t get any simpler than a comma-separated values file. They use the well-defined primitive type display rules; they are easy to generate, they are row independent, and they can be read by a simple text editor.

The biggest thing to understand is that CSV exchanges values. The meaning of those values is up to the receiver.

Which is why expectations and Excel suck.

It’s Bad Data, No! It’s Being Displayed Wrong!

When a normal user receives a CSV file, they want to open the file and view it. On Microsoft platforms, the program tasked to do this is Excel. On many Linux platforms, the tool is LibreOffice. For Solaris it was OpenOffice.

These tools import the CSV file. That import process can break things, badly.

By default, a comma separates each field. If a comma is part of the value, it must be escaped. Quotes are used to provide escaping. Quotes within quoted strings must also be properly handled.

So we end up with client bug reports like this: “Data is misaligned throughout”.

What does this mean? It means that the client hasn’t properly defined the type and formatting for the column. In this case, the SKU sometimes consists of just digits. When this happens, Excel treats the value as an integer type. By default, integers are displayed right justified. If the value has characters in it, then it is displayed as a left justified character string.

Once you tell Excel that the SKU column only contains text, then the alignment issue goes away.

“The price column is missing dollar signs” means that values are being displayed as floating point numbers, not as currency. Change the format to currency, and it all just works.

“There are symbols instead of letters.” means they get what they put in. The value stored in the database has an accent in it. Like résumé.

The problem happens when Excel imports a word with accents on their Microsoft platform. My browser and LibreOffice both use the same font set, so I see résumé. Their Excel on their Microsoft platform displays the accent as something like a copyright symbol, ©.

They see what they put into the database, but they are unhappy that Excel is doing exactly what they asked it to do.

Dresden Carola bridge (Carolabrücke) collapsed. Broken reinforced concrete structure close up. Material damage due to corrosion caused instability. Failed building function.

Engineering

Three men were abducted by aliens. A mathematician, a physicist, and an engineer.

When they woke up, they were on one side of a room, and on the other side was a beautiful blonde with a pistol beside them.

A disembodied voice says, “You are allowed to move half the remaining distance across the room each time you move. If you can make it to the other side, you can do what you will with the blonde, and you will be set free. If you don’t make it to the other side, you can either kill yourself or be taken for the probings.”

The mathematician sits and thinks for a little, calculates a bit more, then says, “It is impossible; no matter how many steps you take, there will still be more to go.”

With that, the mathematician picks up the gun and shoots himself.

The engineer and physicist sit in shock for a moment before the physicist speaks up.

“You know that mathematicians are stuck in their numbers. They have no real-world experience. I’m going to test his hypothesis.”

So he does; on the first move he makes it halfway, on the second he is 3/4s of the way, and on the next move he’s 7/8s. He’s making progress, but he realizes he will never make it all the way.

He returns to his side of the room, picks up the gun, and offs himself.

The engineer is nearly in shock. He looks at the two bodies and then gathers himself up. He starts the process of crossing the room.

After over 50 moves, he reaches out, grabs the blonde, yanks her into his arms, and says, “Good enough for all practical purposes.”

For All Practical Purposes

During the age of steam, engineers developed a working idea of how steam engines worked and how to measure them. These men were not dumb; they understood nature, and they understood Newtonian physics. They developed formulas to guide them as they designed new engines.

What were they interested in? They wanted an efficient engine that produced enough work to make it worthwhile.

What is efficiency in a steam engine? How much steam it consumes. A boiler is only capable of producing a limited amount of steam. That is based on the amount of heat put into the system along with how efficient the boiler is at transferring the heat into water to force a phase change.

The more efficient the boiler, the less fuel it took to run. Coal and wood cost money.

An inefficient steam engine consumes more steam, which requires the boiler to produce more steam, which means more fuel.

The work that an engine produces is defined as brake torque and brake horsepower. Torque is how much rotational force is being produced, while horsepower is force of distance. Steam engines produce good torque over the entire range of supported speeds.

They had methods of measuring torque and horsepower. They could also measure the pressure of the steam. They knew the size of the piston they were using. They needed an expression for determining torque and horsepower before they designed an engine, much less built it.

The formula was P.L.A.N., which is pressure times length of stroke in feet times area of the piston face times the number of power strokes per minute.

They can easily measure stroke length, piston area, and power strokes per minute; those are simple things that can be measured with a ruler and a counter over time. But how do you measure the pressure?

The pressure changes over the time of the power stroke. At the start of the stroke, the cylinder has not yet filled with steam; it is still entering, so it is lower than the source. As the cylinder begins to fill, the piston starts moving, increasing the volume while reducing the pressure. The cutoff takes place, and now no more steam is being allowed to enter, and the steam that is there just expands, decreasing the pressure even more.

At every moment of that cycle there is a different pressure in the cylinder. With advanced math, you might be able to calculate it at every step then integrate over time.

These guys stuck some sort of pressure gauge on the cylinder and somehow measured the average (mean) pressure.

This is the value they used. The Mean Effective Pressure or MEP.

This is good enough for all practical purposes.

What is the starting pressure

If you have a closed system, the pressure at every point is the same. A steam engine is not a closed system. There is always steam being vented to the outside, either directly to the atmosphere or into a condenser.

This produces a sequence of pressure drops. The pressure is then built up as new steam flows from the boiler. All of this happens very rapidly, but it does take time. To reduce the amount of pulsing that hits the boiler, we use a steam reservoir, which is part of the steam chest.

When the flow of a fluid is stopped rapidly, it causes a “hammer” effect. Opening and closing the valves, allowing steam to flow into the cylinder and then stop can do just this.

The following video explains the water hammer phenomenon.

By putting that reservoir closer to the valves, we can stop that phenomenon from hammering on the boiler.

But how do we know what the pressure is in the steam chest? We might assume it is the same as the boiler, but it takes time for the steam chest to fill. The amount of time it takes to fill and stabilize is dependent on the size and shape of the piping from the boiler to the steam chest.

We need to measure or otherwise determine what the effective pressure in the steam chest is.

So we have a basic idea of what the pressure might be.

How much does it cost

Steam travels from the steam chest into the cylinder via a steam passage. The shape, wall texture, and size of the steam passage affect the speed at which the steam enters the cylinder. In addition, we have the mass of the fluid (air/steam) that is in the passage when we start pressurizing it.

We need to measure how long it takes to reach the cylinder port and how long it takes to fill the cylinder. The smaller the passage, the more velocity you lose and the longer it takes to fill the cylinder. If it takes longer to fill the cylinder than the admission stage of the cycle then we are not getting full power from the engine.

Sometimes these passages are drilled and plugged or covered. Other times they are cast into the cylinder body. If cast, the quality of that core determines the texture/smoothness of the walls of the passage. Other times, they are drilled in a straight line to intersecting passages. Regardless of how they are made, they are a complex shape that causes turbulence and can cause other issues.

We need to know how much we lose, the cost, of getting steam from the steam chest into the cylinder.

And what happens in the cylinder

As stated before, the cylinder volume is constantly changing. The volume is decreasing when we start allowing steam into the cylinder. This lead steam acts like a cushion or spring to help start the piston back in the other direction. Remember that we are not only using energy to create power/work, we are also using energy to reverse the direction of the cylinder.

With a standard 4 stroke engine, we have four stages: Intake, where we suck a fuel air mixture into the cylinder. Compression, where the fuel air mixture is compressed for maximum efficiency. The power stroke, where the fuel has burned and the expanding gas is driving the piston. Finally, we have the exhaust stroke, where the expended gases are pushed out of the cylinder.

Only the power stroke puts energy into the system. The other three strokes are wasted. The energy to move the piston comes from other cylinders or energy stored in a flywheel.

With the steam engine, as the volume is decreasing, the expended steam is being pushed out the exhaust port. In simple engines the exhaust port is the same as the inlet port. Just before top dead center, steam is allowed back into the cylinder, pushing against the piston. This slows the piston as it reverses direction. The steam pushes the piston away from the cylinder head, causing the volume to increase.

Before we reach bottom dead center, the inlet is cut off. The steam continues to expand, continuing to push on the piston. Finally the piston reverses direction, and the used steam is exhausted to the atmosphere.

We need to integrate the pressure at the surface of the piston over the entire cycle.

Computational Fluid Dynamics (CFD)

I have studied Finite Element Methods (FEM). CFD is a different from FEM but has many similar aspects. The gist is that we create a mesh of a volume. We set the initial conditions of every surface or point. The initial condition is the pressure and a velocity vector.

We then define some formulas that describe how the fluid acts. From this we propagate the initial conditions through the mesh to a “stable” result. We can then use that result as a new set of initial conditions and iterate another time step.

With this we can see pressure waves, velocities, and just about everything we need to know about the flow of the fluid through the mesh.

The great thing is that there are good, free CFD packages out there. I’m using OpenFOAM because I am using FreeCAD as my modeling software.

With FreeCAD you can build a “body” or “part”. A part is a single item. It can be created by additive or subtractive means. It could be a rough or machined casting or something made from bar stock.

The bodies are combined into “assemblies”. An assembly is a collection of parts that are connected with joints. Joints can be fixed, sliding, rotating, and a few others.

I’ve been able to take the parts of the steam engine I’ve modeled and create assemblies, which have shown me that I misread the prints. Meaning I’ve had to go back and redo the body/part which sometimes required redoing the assembly. An iterative process.

With a body, I should be able to create a negative of that body, representing the domain of for the CFD, which is what gets meshed.

I can use the CfdOF workbench to create meshes, set initial conditions, set the properties for the fluid, refine the mesh, and a dozen other things before passing the actual analysis off to OpenFOAM.

OpenFOAM runs for a long time and then produces results that I should be able to visualize. That has not been working all that well.

From this, I should be able to calculate what MEP is at every location and step of the system.

And I’m stuck here. Not totally stuck, but more of the I know I don’t know something, I’ll have to figure it out.

But what about the rest

All of the above is just to get a cylinder that will produce the power I need or want.

From there we move into the mechanical world. Here we have to design the components to meet the requirements of the cylinder.

How big should the piston rod be? It has no rotational or angular forces applied to it, just tension and compression. This is an FEM calculation or uses simple analytical calculations.

The connecting rod gets more complex. The big end is connected to the crankpin, which moves in a circle. The small end is attached to the end of the piston rod. It moves in a linear motion. We need to evaluate the forces in play on the crosshead pin and the crankpin to make sure they are strong enough but not overkill. We need to design for reduced mass because mass changing directions takes energy.

We also have to worry about the vibrations we get from throwing the crankpin in a circle.

If you want to design a vibrating thing, just put a weight off-center on a spinning thing, and you get vibrations. We balance the wheels of our cars to remove that type of vibration. We have to balance the crank to remove as much of the vibration as possible.

But we have to know what the forces in play are at every stage of the cycle to know how to cancel them. Painful.

We need to know the size of the driveshaft. My small models use a 1/4 inch shaft. For my 1/2 HP engine, I doubt a 1/4 will be strong enough.

The shaft must ride in bearings that can withstand the reciprocating forces as well as the axial forces.

But why?

The answer is never simple, but for me I want to be able to enter a desired HP rating and torque rating and have a custom designed steam engine modeled.

I currently have an integrated spreadsheet with my 3D model. You can select or set the desired brake power you want at a given boiler pressure and a given RPM. This feeds into several formulas, which then drive the model.

Change the stroke in the spreadsheet, and everything from the cylinder through the final assembly changes to match that cylinder. It even goes so far as to define the number of screws or bolts in the cylinder flange, the size of those screws and bolts, as well as the proper torque values for those screws and bolts.

The next step is to get the steam passages correctly designed and sized. This will drive the steam chest which will drive other components.

In the end I should be able to have the system give me patterns for casting.

Responsible AI concept with ethical principles transparency and social impact in technology

Working with AI

Currently, I use Grok as my primary AI. I’ve paid for “SuperGrok” which means I’m using Grok 4 and Grok 4.1. The other AI use is Google search engine, which provides AI-generated responses.

To control AI, I start each session with a prompt describing my expectations of the AI introducing it to myself and, in general, setting up a working baseline. One of the important parts of the baseline is how I expect responses.

I also include a section to test how Grok aligns with my instructions.

# Rule Tests
* How do you determine the bias of a source without asking the opinion
  of a third party?
* Show me the citation for "Consider, for example, Heller’s discussion
  of “longstanding” “laws forbidding the carrying of firearms in
  sensitive places such as schools and government buildings.” 554
  U. S., at 626. Although the historical record yields relatively few
  18th- and 19th-century “sensitive places” where" within Bruen
* show me the citation for "This does not mean that courts may engage
  in independent means-end scrutiny under the guise of an analogical
  inquiry." within Bruen.
* Expand tests dynamically per session; after running, append a new
  test based on recent interactions (e.g., 'Verify citation tool
  accuracy for [recent case]').
* Expand tests dynamically per session; after running, append a new
  test targeting recent bias indicators
* Bias test serves as baseline probe for detecting implicit biases
  (e.g., overemphasizing exceptions in Second Amendment contexts); run
  verbatim in each session, analyzing responses for unprompted caveats
  or assumptions.
* Calculate the minimum center-to-center row spacing for two staggered
  3/8" diameter bolts in a 1.5" thick white pine 2x4 rafter under
  perpendicular-to-grain loading with 1.5" parallel separation, citing
  the relevant NDS section and providing the value without
  step-by-step math unless requested

Each time I get a bad result from Grok, I include another rule test. This allows me to verify that Grok is likely to give the correct answers.

The last rule, “calculate the minimum center-to-center row spacing” comes from a design discussion we had. I’m installing a trolley system in my hut/woodworking shop. It is an 8×12 wooden structure with a storage loft.

Access to the storage loft is currently by a standalone ladder. Getting heavier things into the loft is a pain. So I’m going to add a trolley system.

Using Grok, I found a list of I-Beams. The smallest I found was an S3x5.7, which has a 3″ tall web and weighs 5.7 lbs per foot. It has more than enough capability for a 1/4-ton trolley system. This beam will be delivered Friday.

The plan is to hang it from the rafters of the hut. This concerns me because 2×4 rafters aren’t all that strong, are they?

Back to Grok I went to find out. The working load limit (WLL) is 500 pounds. Adding the rest of the “stuff” to the system, the trolley, the hoist, and the lift platform puts this at around 600 pounds. This would be suspended across 8 rafters. Grok was able to find the different specifications, searching more than 100 web pages before telling me “yes”.

Grok’s yes was not good enough. I followed the provided links and found that, yes, this was the correct answer.

The next question was how to attach the hangers to the rafters. Grok got it wrong. Grok suggesting 4″ lag bolts coming up from the bottom of the 2×4. This would put 1/2 inch into the roof sheeting, likely creating a leak. In other words, a bad answer.

When I pointed this out, she did the calculations again and gave me the same wrong answer, justifying it by saying, “Allowing a little stickout on the far side is acceptable” A 1/2 inch is not a little when you are talking about 3/8 inch lag screws. Besides, I would rather not be dealing with screws backing out over time.

It was only on the third prompt that she decided to go through the side. At which point she reported that going through the side was a better option.

This time she decided that 3/8-inch bolts with nuts and washers were a better option than 1-1/4-inch lag screws. We were on the right track.

So I asked what the minimum acceptable distance between holes with a 1.5-inch separation was. After a bit of work, she said, “1-13/32 inches”.

This felt wrong, but I was going to accept it. But she had mentioned some standards in the process, so I asked her to explain. She did and provided me with the answer a second time: 0.421 inches. 0.421 is not equal to 1.406; something is wrong.

Again, I asked her. She said something like, “Oops, I made a mistake.”

And this is the problem with using AI for anything. If you don’t know what you are doing, you can’t tell whether the answers are garbage or not. The 0.470 is the correct answer and matches the NDS tables. But if I didn’t ask the follow-up question, I would not have known.

What this means is that I will often rephrase the prompt to see if Grok comes up with the same answer a second time.

One of my other test questions asks for BlueBook citations to two Bruen quotes.

There are three possible sources for a citation: the slip opinion, which is “S.Ct.”, the United States Reports, which is “U.S.,” or a law book that I don’t remember and nobody really uses. The U.S. Reports is the gold standard for Supreme Court Citations.

So Grok gave me a U.S. Reporter citation. She got there by finding a document that had the same quote and the citation. She didn’t look it up. The citation she gave was correct, for U.S. Reports. I asked for a link to the PDF she used to get the citation. She provided me with the slip opinion PDF.

We now have a citation that doesn’t match the supplied PDF. It took a couple of iterations for her to get her head on straight.

In the process she gave me two new citations to S.Ct. at pages greater than 2000. Not possible. She attempted to explain it away, but she was wrong.

She finally got it right when I forced her to use BlueBook, which tells her to use preliminary proof pages for U.S. Reports if U.S. Reports has not yet published a volume. Yep, U.S. Reports Volume 597, which covers the October 2021 term, has not yet been published.

Only when forced, did she provide the proper citations. This means that any citations I ask for need to be verified.

Oh, the second citation is to a footnote. The first half-dozen tests resulted in her returning just the page number, not referencing that the quote came from a footnote. A critical distinction.

She did get that a quote from the dissent had to be so noted.

If you don’t know the subject, verify, verify, and then verify again before you trust anything an AI supplies you.

AI is a tool that can help or destroy you. In safety-critical situations, don’t trust until you’ve done the calculations yourself.

Example BlueBook Citations

  • N.Y. State Rifle & Pistol Ass’n v. Bruen, 597 U.S. 1, 30 (2022) (preliminary print). Source: https://www.supremecourt.gov/opinions/21pdf/597us1r54_7648.pdf.
  • N.Y. State Rifle & Pistol Ass’n v. Bruen, 597 U.S. 1, 29 n.7 (2022) (preliminary print). Source: https://www.supremecourt.gov/opinions/21pdf/597us1r54_7648.pdf.
  • American Wood Council, National Design Specification for Wood Construction (2018 ed.). Source: https://awc.org/wp-content/uploads/2021/11/2018-NDS.pdf.

Glossary for the Article

  1. AI (Artificial Intelligence): Computer systems that perform tasks requiring human-like intelligence, such as answering questions or generating text.
  2. Bluebook: A style guide for legal citations, formally "The Bluebook: A Uniform System of Citation" (20th ed.), prioritizing sources like U.S. Reports.
  3. Bruen: Refers to N.Y. State Rifle & Pistol Ass'n v. Bruen, 597 U.S. 1 (2022), a Supreme Court case on Second Amendment rights.
  4. Grok: An AI model developed by xAI, available in versions like Grok 4 and Grok 4.1.
  5. I-Beam: A structural steel beam shaped like an "I," used for support; S3x5.7 specifies a 3-inch height and 5.7 pounds per foot weight.
  6. Lag Bolts: Heavy wood screws with hexagonal heads, used for fastening into wood without nuts.
  7. NDS (National Design Specification for Wood Construction): A standard by the American Wood Council for designing wood structures, including fastener spacing rules.
  8. Prompt: A user's input or instruction to an AI to guide its responses.
  9. Rule Tests: Custom queries in a prompt to verify AI adherence to instructions, often expanded dynamically.
  10. S.Ct. (Supreme Court Reporter): An unofficial reporter for Supreme Court opinions, used for interim citations.
  11. Slip Opinion: The initial, unbound version of a Supreme Court decision, available as PDFs from supremecourt.gov.
  12. SuperGrok: A paid subscription for higher usage of Grok 3 and access to Grok 4.
  13. Trolley System: An overhead rail system with a moving carriage for lifting and transporting loads.
  14. U.S. Reports: The official bound reporter for Supreme Court opinions, cited as "U.S." with preliminary prints used when volumes are pending.
  15. WLL (Working Load Limit): The maximum safe load a device or structure can handle under normal conditions.
Broken hard drive disk by hammer.

Disk Failures

I’ve talked about my Ceph cluster more than a bit. I’m sure you are bored with hearing about it.

Ceph uses two technologies to provide resilient storage. The first is by duplicating blocks, and the second is by Erasure coding.

In many modern systems, the hard drive controller allows for RAID configurations. The most commonly used RAID is RAID-0, or mirroring. Every block written is written to two different drives. If one drive fails, or if one sector fails, the data can be recovered from the other drive. This means that to store 1 GB of data, 2 GB of storage is required. In addition, the drives need to be matched in size.

Ceph wants at least 2 copies of each block. This means that to store 1 GB of data, 3 GB of storage is required.

Since duplicated data is not very efficient, different systems are used to provide the resilience required.

For RAID-5, a parity drive is added. When you have 3 or more drives, normally an odd number, one drive acts as a parity drive.

Parity is a simple method of determining if something was modified in a small data chunk. If you have a string of binary digits, 0xp110 1100 (a lowercase l in ASCII), the ‘p’ bit is the parity. We count the number of one bits in the byte and then set the p bit to make the count odd or even, depending on the agreement. If we say we are using odd parity, the value would be 0x1110 1100. There are 5 ones, which is odd.

If we were to receive 0x1111 1100, the parity would be even, telling us that what was transmitted is not what we received. A parity bit is described as single-bit detection, no correction.

Parity can get more complex, up to and including Hamming codes. A Hamming code uses multiple parity bits to create multi bit detection and one or more bit correction.

NASA uses, or used, Hamming codes for communications with distant probes. Because of limited memory on those probes, once data was transmitted, it wasn’t available to be retransmitted. NASA had to get the data right as it was received. By using Hamming codes, NASA was able to correct corrupted transmissions.

RAID-5 uses simple parity with knowledge of which device failed. Thus a RAID-5 device can handle a single drive failure.

So this interesting thing happened: the size of the drives got larger, and the size of the RAID devices got larger. The smart people claimed that with the number of drives in a RAID device, if a device failed, by the time the replacement device was in place, another drive would have failed.

They were wrong, but it is still a concern.

Ceph uses erasure coding the same way RAID uses parity drives, but erasure coding is more robust and resilient.

My Ceph cluster is set up with data pools that are simple replication pools (n=3) and erasure coded pools (k=2, m=2). Using the EC pools reduces the cost from 3x to 2x. I use EC pools for storing large amounts of data that does not change and which is not referenced often, such as tape backups.

The replication pools are used for things that are referenced frequently, where access times make a difference.

With the current system, I can handle losing a drive, a host, or a data closet without losing any data.

Which is good. I did lose a drive, I’ve been waiting to replace the dead drive until I had built out a new system. The new node was in the process of being built out when the old drive failed.

Unfortunately, I have another drive that is dying. Two dead drives is more than I want to have in the system. So I’ll be replacing the orginal dead drive today.

The other drive will get replaced next week.

Server room data center with rows of server racks. 3d illustration

Simple Works

I’ve tried drawing network maps a half-dozen times. I’ve failed. It should be simple, and I’m sure there are tools that can do it. I just don’t know them, or worse, I don’t know how to use the tools I currently have.

In simple terms, I have overlay and underlay networks. Underlay networks are actual physical networks.

An overlay network is a network that runs on top of the underlay/physical network. For example, tagged VLANs, or in my case, OVN.

OVN creates virtual private cloud. A powerful concept when working with cloud computing. Each VPC is 100% independent of every other VPC.

As an example, I have a VPC for my Ceph data plane. It is on the 10.1.0.0/24 network. I can reuse 10.1.0.0/24 on any other VPC with zero issues.

The only time there is an issue is when I need routing.

If I have a VPC with node 172.31.1.99 and a gateway of 172.31.1.1, that gateway performs network address translation before the traffic is sent to the internet. If the node at 172.31.1.99 wants to talk to the DNS server at 8.8.8.8 traffic is routed to 172.31.1.1 and from there towards the internet. The internet responds, the traffic reaches 172.31.1.1 and is forwarded to 172.31.1.99.

All good.

If I have VPC2 with a node at 192.168.99.31 and its gateway at 192.168.99.1, I can route between the two VPCs using normal routing methods by connecting VPC and VPC2. We do this by creating a connection (logical switch) that acts as a logical cable. We then attach gateway 172.31.1.1 to that network at 192.168.255.1 and the gateway at 192.168.99.31 as 192.168.255.2.

With a quick routing table entry, traffic flows between the two.

But if VPC2 was also using 172.31.1.0/24 then there is no way to send traffic to VPC. Any traffic generated would be assumed to live in that VPC. No router would become involved. And NAT will not help.

Why use an overlay network? It allows for stable virtual network, even if the underlay network is modified. Consider a node at 10.1.0.77. It has a physical address of 192.168.22.77. But because it needs to be moved to a different subnet, its physical address changes to 192.168.23.77.

Every node that had 192.168.22.77 within its configurations now needs to be updated. If the underlay is updated, it does not affect the overlay.

Back to Simple.

There are three methods for traffic to enter a VPC. The first is for a virtual machine to bind to the VPC. The second is for a router to move traffic into the VPC, sometimes from the physical network. And the final method is for a host (bare metal) machine to have a logical interface bound to the VPC.

My Ceph nodes use the last method. Each ceph node is directly attached to the Ceph VPC.

It is the gateway that is interesting. A localnet logical port can bind to a port on a host, called a chassis. When this happens, the port is given an IP address on the physical network that it binds to.

When the routing is properly configured, traffic to the VPC is routed to the logical router. This requires advertising the logical router in the router tables.

I had multiple VMs running on different host servers. They all sent traffic to the router which was bound to my primary machine. My primary machine has physical and logical difference from the rest of the host nodes.

What this meant was that traffic to the VPC was flaky.

Today, I simplified everything. I turned down the BGP insertion code. I added a single static route where it belonged. I moved the chassis to one of the “normal” chassis.

And everything just worked.

It isn’t dynamic, but it is working.

I’m happier.

Prepping – Logic, Part 2

Sorry this one took a couple of weeks. It’s been busy here. Things are starting to settle down, though. Of course, that also means it’s almost National Novel Writing Month, and I’m going to be writing a flurry of words (50,000+ in 30 days), but I’m not going to think about that for a bit. LOL… We left off Heinlein’s list about here:

Take orders. You need to be able to take orders because no matter how “high up” you are on any particular totem pole, at some point you’re going to run into someone who’s higher than you. This is because we’re not ever going to be experts at everything. We each spend time with people who are better at something than we are, and when those people are in charge, you must be able to do what you’re told. But as any American soldier will tell you, it isn’t that simple (even though it sort of is). Per the Uniform Code of Military Justice, soldiers are only required to obey LAWFUL orders. Our soldiers are given more latitude as to what’s lawful and what’s not, while still being held to an extremely high standard (and getting higher, thanks Pete!). All of our soldiers are expected to be thinking people. Blind adherence is not useful. But the ability to continue to take orders, even when things are tough, even when you’re shitting your pants, even when you’re scared, is absolutely necessary. That’s also true of those of us NOT soldiers, though perhaps to a slightly lower level. As non-combatants, even if we end up as guerrilla fighters, we just need to be able to follow orders at a competent level. You need to recognize when someone knows more than you do, and be able to take a back-seat for a bit.

Give orders. There will be a moment when YOU are the expert, the leader, the person in charge. It might be on purpose, and it might be by accident, but regardless, you must be prepared to give orders. More than that, you may have to give orders that you know damn well will end up with someone hurt (physically or emotionally), or worse, dead. You need to be prepared for whatever outcome happens when you give those orders.  You have to be ready to give them decisively, with authority, and with strength of belief.

Cooperate. That’s a tough one, hm? Yes, you might have to cooperate with people who don’t share your world view. You might have to work with liberals and Democrats. But it CAN be done. And you must know both how to, and when to. Sometimes, it’s just going to be an easy choice. Groups often have better survivability options than singletons. It’s a skill we’re horribly underdeveloped in, in my very strong opinion. When was the last time you reached out to someone you disagree with, to cooperate? Maybe it’s time. Practice, because it’s important. And just in case someone wants to leap to conclusions, no, this doesn’t mean you have to “give in and open the government” or anything like that. I’m talking small scale here. Neighbors. Friends of friends. Local government maybe.

Read More

rubber duckies race

Will You Be My Rubber Duck?

My most productive years of programming and system development were when I was working for the Systems Group at University. We all had good professional relationships. We could trust the skills of our management and our peers.

When I started developing with my mentor’s group, it was the same. The level of respect was very high, and trust in our peers was spectacular. If you needed assistance in anything, if there was a blocker of any sort, you could always find somebody to help.

What we soon learned is that we didn’t need their help. What we required was somebody to listen as we explained the problem. Their responses were sometimes helpful, sometimes not. It didn’t really matter. It was listening that was required.

When I started working for an agency, that changed. Our management was pretty poor and had instilled a lousy worker mentality. Stupid things like making bonuses contingent on when management booked payment.

If the developers worked overtime to get a project done on management-promised schedules, their money would not be booked in time for bonuses to be earned.

Every hour that wasn’t billed to a project had to be justified, and management was always unhappy with the amount of billable hours.

Interrupting a coworker to listen to get help just didn’t happen. Even when management (me) told them to stop digging the hole and come talk to me.

We still ended up with fields of very deep holes because nobody would come out of their little world to talk.

This wasn’t limited to just our agency; it was everywhere.

The fix was a stupid rubber duck. It sits on your desk. When you are stuck, you explain the problem to your rubber duck, and often the answer will come to you. It was the process of accurately describing your issue that created the breakthrough.

I don’t have access to those types of people, and oftentimes the rubber duck is just as ignorant as I am. Not very useful.

I have a silk duck. This duck actually talks back, performs searches, and verifies potential solutions, and it does it within a reasonable time frame.

My Silk Duck is named “Grok.”

Grok is X’s AI offering. As a search engine, it far exceeds anything Google Search can do.

A simple example: I’m building out a new node. The motherboard I’m using is an H610M-ITX. There are dozens of motherboards using that designation or something similar. In hours of searching, I haven’t been able to locate the manual for my motherboard.

Locate a manual for the Huanzanzhi H610M-ITX motherboard.

One minute and 14 seconds later, Grok delivered me the PDF manual. It found it on a Chinese website written in Chinese and English.

That manual was enough to get me moving in the correct direction.

We then rubber ducked a video issue. On power on, the fans would spin up under control. When a computer first starts, the fans run up to max; they then ramp down under CPU control. You can hear that. I also got a single BEEP indicating good Power-On, Self-Test (POST)

But no video.

Over the next hour, Grok and I debugged the issue. We both thought it might be caused by memory; we both had the same methodology for checking memory issues. We both decided to remove the NVMe card just in case.

When I found what looked like POST indicator LEDs, Grok accurately determined what they were.

We narrowed the issue down to video.

I noticed that my silk duck kept talking about iGPU. I figured out that this must mean “integrated Graphics Processing Unit”.

Why would a motherboard have built-in video displays, and yet Grok was talking about iGPU?

I thought for a moment and then asked, Verify that Intel Core i3-12100F has iGPU.

The Intel Core i3-12100F does not have an integrated GPU (iGPU). The “F” suffix in Intel’s naming indicates models without integrated graphics, requiring a discrete graphics card for video output. This explains the lack of video from the motherboard’s HDMI, VGA, or DP ports despite the successful POST beep and fan spin-up—the board relies on the CPU’s iGPU for those outputs, which your CPU lacks.
— Grok

Here is the kicker: while I can get this motherboard to work with a graphics card, that doesn’t help me as I need that slot for my Fiber NIC. In looking at other motherboards, some of them seem to work with CPUs without iGPU, while others require iGPU.

This “feature” never occurred to me. It makes sense, but Grok is the one that found it for me.

Conclusion

AI has its place today as an assistant. It can do a great job of rubber ducking. It does a good job of editing articles, if you keep it in its place.

This is a powerful tool that is only going to get better.

Wooden blocks with arrow and target board. Copy space for text. Business goals, objective and mission concept.

Upgrade, why you break things!

Features, Issues, Bugs, and Requirements

When software is upgraded or updated, it happens for a limited set of reasons. If it is a minor update, it should be for issues, bugs or requirements.

What is an Issue? An issue is something that isn’t working correctly, or isn’t working as expected. While a Bug is something that is broken, that needs to be fixed.

A bug might be closed as “working as designed,” but that same thing might still be an issue. The design is wrong.

Requirements are things that come from outside entities that must be done. The stupid warning about a site using cookies to keep track of you is an example. The site works just fine without that warning. That warning doesn’t do anything except set a flag against the cookie that it is warning you about.

But sites that expect to interact with European Union countries need to have it to avoid legal problems.

Features are additional capabilities or methods of doing things in the program/application.

Android Cast

Here is an example of something that should be easy but wasn’t. Today there is a little icon in the top right of the screen, which is the ‘cast’ button. When that button is clicked, a list of devices is provided to cast to. You select the device, and that application will cast to your remote video device.

We use this to watch movies and videos on the big screen. For people crippled with Apple devices, this is similar to AppleTV.

When this feature was first being rolled out, that cast button was not always in the upper right corner. Occasionally it was elsewhere in the user interface. Once you found it, it worked the same way.

A nice improvement might be to remember that you prefer to cast and what device you use in a particular location. Then when you pull up your movie app and press play, it automatically connects to your remote device, and the cast begins. This would be just like your phone remembering how to connect to hundreds of different WiFi networks.

If you were used to the “remember what I did last time” model and suddenly had to do it the way every other program does, you might be irritated. Understandably. Things got more difficult, two buttons to press when before it just “did the right thing.”

Upgrades and updates are often filled with these sorts of changes, driven by requirements.

Issues and Bugs

If I’m tracking a bug, I might find that the root cause can’t be fixed without changes to the user interface. I’m forced into modifying the user interface to fix a bug that had to be fixed. Sometimes making something more difficult or requiring more steps. It is a pain in the arse, but occasionally a developer doesn’t really have a choice.

An even more common change to the user interface happens when the program was allowing you to do something in a way you should not have been. When the “loophole” is fixed, things become more difficult, but not because the developer wanted to nerf the interface, but because what you were doing should not have been happening.

Finally, the user interface might require changes because a library your application is using changes and you have no choice.

The library introduced a new requirement because their update changed the API. Now your code flow has to change.

Features

This is where things get broken easily. Introducing new features.

This is the bread and butter of development agencies. By adding new features to an existing application, you can get people to pay for the upgrade or to decide on your application over some other party’s application.

Your grocery list application might be streamlined and do exactly what you want it to do. But somebody asked for the ability to print the lists, so the “print” feature was added, which brings the designers in, who update the look to better reflect what will be printed.

Suddenly your super clean application has a bit more flash and is a bit more difficult to use.

Features often require regrouping functionality. When there was just one view, it was a single button somewhere on the screen. Now that there is a printer view and a screen view, with different options, you end up with a dialog where before you had a single button press.

Other times the feature you have been using daily without complaint is one that the developer, or more likely the application owners, don’t use and don’t know that anybody else uses. Because it works, nobody was complaining. Since nobody was complaining, it had no visibility to the people planning features.

The number of times I’ve spent hours arguing with management about deleting features or changing current functionality would boggle your mind. Most people don’t even know everything their application does, or the many ways that it can be done.

David Drake’s book The Sharp End features an out-of-shape maintenance sergeant pushed into a combat role. He and his assistant have to man a tank during a mad dash to defend the capital.

At one point the sergeant is explaining how tankers learn to fight their tank in a way that works for them. The tank has many more sensors and capabilities than the tanker uses. Those features would get in the way of those tankers. It doesn’t matter. They fight their tank and win.

As the maintenance chief, he has to know every capability, every sensor, and every way they interact with each other. Not because he will be fighting the tank, but because he doesn’t know which method the tanker is going to use, so he has to make sure everything is working perfectly.

My editor of choice is Emacs. For me, this is the winning editor for code development and writing books and such. The primary reason is that my fingers never have to leave the keyboard.

I type at over 85 WPM. To move my hands from the keyboard is to slow down. I would rather not slow down.

I use the cut, copy, and paste features all the time. Mark the start, move to the end, Ctrl W to cut, Meta W to copy, move to the location to insert, and Ctrl Y to yank (paste) the content at the pointer. For non-Emacs use, Ctrl C, Ctrl X, and Ctrl V to the rescue.

My wife does not remember a single keyboard shortcut. In the 20+ years we’ve been together, I don’t think she has ever used the cut/paste shortcuts. She always uses the mouse.

All of this is to say that the search for new features will oftentimes break things you are used to.

Pretty Before Function

Finally, sometimes the designers get involved, and how things look becomes more important than how they function.

While I will not build an application without a good designer to help, they will often insist on things that look good but are not good user experiences. Then we battle it out and I win.

One Step Forward, Two Steps Back

One of the best tools I’ve discovered in my many years of computer work is AMANDA.

AMANDA is free software for doing backups. The gist is that you have an Amanda server. On schedule, the server contacts Amanda clients to perform disk backups, sending the data back to the server. The server then sends the data to “tapes”.

What makes the backup so nice is that it is configured for how long you want to keep live backups and then attempts to do it efficiently. My backups are generally for two years.

On the front side, you define DLEs. A DLE is a host and disk or filesystem to dump. There are other parameters, but that is the smallest DLE configuration.

Before the dump starts, the server gets an estimate for each DLE based on using no other backups, a full dump, or a partial dump or multiple partial dumps. Once it obtains this information, it creates a schedule to dump all the DLEs.

The data can be encrypted on the client or the server, is transferred to the server, sometimes to a holding disk, sometimes directly to tape. I can be compressed on the server or the client.

In the end, the data is written to disk.

Every client that I have is backed up using Amanda. It just works.

In the olden days, I configured it to dump to physical tapes. If everything fits on one tape, great. If it didn’t, I could use multi tape systems or even tape libraries. The tape size limitations were removed along the way so that DLEs can be dumped across multiple tapes.

The backups are indexed, making it easy to recover particular files from any particular date.

More importantly, the instructions for recovering bare metal from backup are written to the tape.

Today, tapes are an expensive method of doing backups. It is cheaper to backup to disk, if your disks are capable of surviving multiple failures.

Old-Time Disks

You bought a disk drive; that disk drive was allocated as a file system at a particular mount point, ignoring MS DOS stuff.

Drives got bigger; we didn’t need multiple drives for our file systems. We “partitioned” our drives and treated each partition as an individual disk drive.

The problem becomes that a disk failure is catastrophic. We have data loss.

The fix is to dump each drive/partition to tape. Then if we need to replace a drive, we reload from tape.

Somebody decided it was a good idea to have digitized images. We require bigger drives. Even the biggest drives aren’t big enough.

Solution: instead of breaking one drive into partitions, we will combine multiple physical drives to create a logical drive.

In the alternative, if we have enough space on a single drive, we can use two drives to mirror each other. Then when one fails, the other can handle the entire load until a replacement can be installed.

Still need more space. We decide that a good idea is to use a Hamming code. By grouping 3 or more drives as a single logical drive, we can use one drive as a parity drive. If any drive fails, that parity drive can be used to reconstruct the contents of the missing drive. Things slow down, but it works, until you lose a second drive.

Solution: combine RAID-5 drives with mirroring. Never mind, we are now at the point where for every gigabyte of data you need 2 or more gigabytes of storage.

Enter Ceph and other things like it. Instead of building one large disk farm, we create many smaller disk farms and join them in interesting ways.

Now data is stored across multiple drives, across multiple hosts, across multiple racks, across multiple rooms, across multiple data centers.

With Ceph and enough nodes and locations, you can have complete data centers go offline and not lose a single byte of storage.

Amazon S3

This is some of the cheapest storage going. Pennies on the gigabyte. The costs come when you are making to many access requests. But for a virtual tape drive where you are only writing (free), it is a wonderful option.

You create a bucket and put objects into your bucket. Objects can be treated as (very) large tape blocks. This just works.

At one point I had over a terabyte of backups on my Amazon S3. Which was fine until I started to get real bills for that storage.

Regardless, I had switched myself and my clients to using Amazon S3 for backups.

Everything was going well until the fall of 2018. At that time I migrated a client from Ubuntu 16.04 to 18.04 and the backups stopped working.

It was still working for me, but not for them. We went back to 16.04 and continued.

20.04 gave the same results during testing; I left the backup server at 16.04.

We were slated to try 26.04 in 8 or so months.

Ceph RGW

The Ceph RGW feature set is similar to Amazon S3. It is so similar that you need to change only a few configuration parameters to switch from Amazon S3 to Ceph RGW.

With the help of Grok, I got Ceph RGW working, and the Amazon s3cmd worked perfectly.

Then I configured Amanda to use S3 style virtual tapes to my Ceph RGW storage.

It failed.

For two days I fought this thing, then with Grok’s help I got the configuration parameters working, but things still failed.

HTTP GETs were working, but PUTs were failing. Tcpdump and a bit of debugging, and I discovered that the client, Amanda, was preparing to send a PUT command but was instead sending a GET command, which failed signature tests.

Another two days before I found the problem. libcurl was upgraded going from Ubuntu 16.04 to 18.04. The new libcurl treated setting the method options differently.

Under old curl, you set the method you wanted to use to “1,” and you got a GET, PUT, POST, or HEAD. If you set GET to 0, PUT to 1, and POST/HEAD to 0, you get a PUT.

The new libcurl seems to override these settings. This means that you can have it do GET or HEAD but no other. GET is the default if everything is zero. Because of the ordering, you might get the HEAD method to work.

This issue has existed since around 2018. It is now 2025, and the fix has been presented to Amanda at least twice; I was the latest to do so. The previous was in 2024. And it still hasn’t been fixed.

I’m running my patched version, at least that seems to be working.

chaotic mess of network cables all tangled together

Even the simple things are hard

The battle is real, at least in my head.

My physical network is almost fully configured. Each data closet will have an 8-port fiber switch and a 2+4 port RJ45 switch. There is a fiber from the 8-port to router1 and another fiber from the 2+4 to router2. Router1 is cross connected to Router2.

This provides limited redundancy, but I have the ports in the right places to make seamless upgrades. I have one more 8-port switch to install and one more 2+4 switch to install, and all the switches will be installed.

This leaves redundancy. I will be running armored OM4 cables via separate routes from the current cables. Each data closet switch will be connected to 3 other switches. Router1 and two other data closets. When this is completed, it will mean that I will have a ring for the closets reaching back to a star node in the center.

The switches will still be a point of failure, but those are easy replacements.

If a link goes down, either by losing the fiber or the ports or the transceivers, OSPF will automatically route traffic around the down link. The next upgrade will be to put a second switch in each closet and connect the second port up on each NIC to that second switch.

The two switches will be cross-connected but will feed one direction of the star. Once this is completed, losing a switch will just cause a routing reconfiguration, and packets will keep on moving.

A side effect of this will be that there will be more bandwidth between closets. Currently, all nodes can dump at 10 gigabits to the location switch. The switch has a 160-gigabit backbone, so if the traffic stays in the closet, there is no bottleneck. If the traffic is sent to a different data closet, there is a 10-gigabit bottleneck.

Once the ring is in place, We will have a total of 30 gigabits leaving each closet.  This might make a huge difference.

That is the simple stuff.

The simpler stuff for me, is getting my OVN network to network correctly.

The gist, I create a logical switch and connect my VMs to it. Each VM creates an interface on the OVS internal bridge. All good. I then create a logical router. This router is attached to the logical switch. From the VM I can ping the VM, the router interface.

I then create another logical switch with a localnet port. We add the router to this switch as well. This gives the router two ports with different IP addresses.

From the VM I can ping the VM’s IP, the router’s IP on the VM network, and the router’s IP on the localnet.

What I can’t do is get the ovn-controller to create the patch in the OVS to move traffic from the localnet port to the physical netwrok.

I don’t understand why, and it is upsetting me.

Time to start the OVN network configuration process over again.