It Is All Ones and Zeros
There are times when I wish I had entered the Computer Science world a few years earlier.
My mentor was born four years before I was. He went through the computer science program four years before I did at a prestigious university.
His academics took place as electrical engineering was morphing into computer science. Which meant that he was taught circuit design at a lower level than we were taught. He was taught how to design transistors and to build the hardware.
My class was taught bits, bytes, and words. We built on what his generation was learning as computers advanced.
A bit is the smallest amount of data that a computer can store. In fact, it is the only type of data that a computer can store.
Bits are grouped into bytes and words. Today, a byte is 8 bits long; in past years, it could be larger or smaller than that. I’ve worked on machines where a byte was defined as 12 bits.
Bits, by themselves, have no meaning. A programmer assigns meaning to a sequence of bits. Let’s take a byte of 8 bits, for example. The byte we will be looking at has a sequence of bits like this 0011 0001.
Again, this is meaningless until we assign meaning to it. If we say that it is an unsigned tiny integer, then a byte represents a value between 0 and 255, while our byte represents the integer 49. 0*128 + 0*64 + 1*32 + 1*16 + 0*8 + 0*4 + 0*2 + 1*1 = 49 base 10. We can also express that integer in base 16 as 0x31 or in base 8 as 061.
We could define it as a signed tiny integer; then a byte represents a value between -127 and 128. But one of the most magical ways of looking at the byte is as a character. In this format we have a table of values to glyphs or characters. The value 0x31, 49, 061, 0011 0001 is interpreted as the character “1”. In the same way, the value 0x41, 65, 101, 0100 0001 is interpreted as the character “A”.
In other words, a bit pattern doesn’t have meaning until we define what the value means.
Primitive Types
The CPU in a computer has several registers. Each register holds a bit pattern of a given size. The CPU can then manipulate registers with a fixed set of instructions. Those instructions define the meaning of the register for that instant. If we use the integer add operation, then the two registers are treated as integers with the result being stored in a third register. If we use floating point operations, then we treat the registers as 32 or 64 bit floating point numbers. Doubles are 128 or 64 bits. We can treat the registers as containing one byte or character or a multiple characters. Or, we can treat each bit as a boolean.
For languages, we normally have integer, unsigned integer, float, double, character, and string. These are all referred to as primitives.
While we have defined the type, we have not defined the meaning. For example, 1234.70 is a floating point number. But what does it mean.
It could be a price, a quantity, a physical measurement. If it is a physical measurement, then it is expressing units.
It is the meaning we give values that allows humans to interact with the data.
Formatting
Let’s say we are working with a basic product object. Each product has a SKU, price, description, and quantity in stock. We will call these “labels”. We give a primitive type to each. SKU=>string, price=>float, description=>string, quantity=>integer.
This is a good start, but we also define how we will format these values when we display the product for a user. We can say that SKU and description will be left-aligned, price will be formatted as currency ($x,xxx.xx) right-aligned, while quantity will be formatted as an integer (x,xxx) right-aligned. This formatting is encoded in the knowledge of the meaning of the labeled data.
Formatting is not part of the data; it is associated with the label. The label allows us to assign meaning to the data.
Viewing Data
Humans have a difficult time applying meaning to bit patterns, so each primitive type has a standard text format. This allows us to see the values of the data.
For example, we say that strings are input and displayed as quoted strings, “This is an example string”. Integers are input and displayed as an optional negative sign followed by a sequence of digits, 1-9 as the first character followed by 0 or more 0-9 characters. Or it can be a 0. This defines a base 10 integer. If integer starts with 0x, then what follows the x is an integer in base 16. if the first character after the sign is a 0, then it is octal.
Simple.
These rules for displaying and inputting values are well defined.
Information Interchange
This is a gigantic subject. We are going to barely touch on it. To transfer information in a meaningful way, we have to define the meaning of each datum that is exchanged.
There are specific tools for doing this. XML, JSON, YAML, SOAP, and others are designed specifically for this process.
Unfortunately, there are de facto “rules” for exchanging data. Rules that must be followed but that the people using them do not understand.
Excel, Word, and Other MicroSoft Monstrosities
The default for most people when exchanging a table of information is an Excel file, an xls file. How I hate this.
An Excel file gives labels to values and adds formatting but does not add meaning. Meaning comes from external sources.
So, we might have a two cells we are looking at; B1 has a value of “Price”. It is formatted as bold text, centered. B2 has a value of 7.50. It is formatted as text, so it displays as “$7.50”. If the cell was formatted as “number” or “general” it would display as 7.5.
It is the user who applies these formatting rules. It is the user who provides meaning to these values.
If you have an application that can read and display Excel sheets, then all is good.
But Excel sheets are not a great way to exchange data. As a matter of fact, they suck. Each cell must contain both formatting and values. There are linkages between cells and a hundred other things that can be added. They are painful to create programmatically.
The Savior of Data Interchange, CSV
It doesn’t get any simpler than a comma-separated values file. They use the well-defined primitive type display rules; they are easy to generate, they are row independent, and they can be read by a simple text editor.
The biggest thing to understand is that CSV exchanges values. The meaning of those values is up to the receiver.
Which is why expectations and Excel suck.
It’s Bad Data, No! It’s Being Displayed Wrong!
When a normal user receives a CSV file, they want to open the file and view it. On Microsoft platforms, the program tasked to do this is Excel. On many Linux platforms, the tool is LibreOffice. For Solaris it was OpenOffice.
These tools import the CSV file. That import process can break things, badly.
By default, a comma separates each field. If a comma is part of the value, it must be escaped. Quotes are used to provide escaping. Quotes within quoted strings must also be properly handled.
So we end up with client bug reports like this: “Data is misaligned throughout”.
What does this mean? It means that the client hasn’t properly defined the type and formatting for the column. In this case, the SKU sometimes consists of just digits. When this happens, Excel treats the value as an integer type. By default, integers are displayed right justified. If the value has characters in it, then it is displayed as a left justified character string.
Once you tell Excel that the SKU column only contains text, then the alignment issue goes away.
“The price column is missing dollar signs” means that values are being displayed as floating point numbers, not as currency. Change the format to currency, and it all just works.
“There are symbols instead of letters.” means they get what they put in. The value stored in the database has an accent in it. Like résumé.
The problem happens when Excel imports a word with accents on their Microsoft platform. My browser and LibreOffice both use the same font set, so I see résumé. Their Excel on their Microsoft platform displays the accent as something like a copyright symbol, ©.
They see what they put into the database, but they are unhappy that Excel is doing exactly what they asked it to do.










