Computer binary code zero one numbers

There are times when I wish I had entered the Computer Science world a few years earlier.

My mentor was born four years before I was. He went through the computer science program four years before I did at a prestigious university.

His academics took place as electrical engineering was morphing into computer science. Which meant that he was taught circuit design at a lower level than we were taught. He was taught how to design transistors and to build the hardware.

My class was taught bits, bytes, and words. We built on what his generation was learning as computers advanced.

A bit is the smallest amount of data that a computer can store. In fact, it is the only type of data that a computer can store.

Bits are grouped into bytes and words. Today, a byte is 8 bits long; in past years, it could be larger or smaller than that. I’ve worked on machines where a byte was defined as 12 bits.

Bits, by themselves, have no meaning. A programmer assigns meaning to a sequence of bits. Let’s take a byte of 8 bits, for example. The byte we will be looking at has a sequence of bits like this 0011 0001.

Again, this is meaningless until we assign meaning to it. If we say that it is an unsigned tiny integer, then a byte represents a value between 0 and 255, while our byte represents the integer 49. 0*128 + 0*64 + 1*32 + 1*16 + 0*8 + 0*4 + 0*2 + 1*1 = 49 base 10. We can also express that integer in base 16 as 0x31 or in base 8 as 061.

We could define it as a signed tiny integer; then a byte represents a value between -127 and 128. But one of the most magical ways of looking at the byte is as a character. In this format we have a table of values to glyphs or characters. The value 0x31, 49, 061, 0011 0001 is interpreted as the character “1”. In the same way, the value 0x41, 65, 101, 0100 0001 is interpreted as the character “A”.

In other words, a bit pattern doesn’t have meaning until we define what the value means.

Primitive Types

The CPU in a computer has several registers. Each register holds a bit pattern of a given size. The CPU can then manipulate registers with a fixed set of instructions. Those instructions define the meaning of the register for that instant. If we use the integer add operation, then the two registers are treated as integers with the result being stored in a third register. If we use floating point operations, then we treat the registers as 32 or 64 bit floating point numbers. Doubles are 128 or 64 bits. We can treat the registers as containing one byte or character or a multiple characters. Or, we can treat each bit as a boolean.

For languages, we normally have integer, unsigned integer, float, double, character, and string. These are all referred to as primitives.

While we have defined the type, we have not defined the meaning. For example, 1234.70 is a floating point number. But what does it mean.

It could be a price, a quantity, a physical measurement. If it is a physical measurement, then it is expressing units.

It is the meaning we give values that allows humans to interact with the data.

Formatting

Let’s say we are working with a basic product object. Each product has a SKU, price, description, and quantity in stock. We will call these “labels”. We give a primitive type to each. SKU=>string, price=>float, description=>string, quantity=>integer.

This is a good start, but we also define how we will format these values when we display the product for a user. We can say that SKU and description will be left-aligned, price will be formatted as currency ($x,xxx.xx) right-aligned, while quantity will be formatted as an integer (x,xxx) right-aligned. This formatting is encoded in the knowledge of the meaning of the labeled data.

Formatting is not part of the data; it is associated with the label. The label allows us to assign meaning to the data.

Viewing Data

Humans have a difficult time applying meaning to bit patterns, so each primitive type has a standard text format. This allows us to see the values of the data.

For example, we say that strings are input and displayed as quoted strings, “This is an example string”. Integers are input and displayed as an optional negative sign followed by a sequence of digits, 1-9 as the first character followed by 0 or more 0-9 characters. Or it can be a 0. This defines a base 10 integer. If integer starts with 0x, then what follows the x is an integer in base 16. if the first character after the sign is a 0, then it is octal.

Simple.

These rules for displaying and inputting values are well defined.

Information Interchange

This is a gigantic subject. We are going to barely touch on it. To transfer information in a meaningful way, we have to define the meaning of each datum that is exchanged.

There are specific tools for doing this. XML, JSON, YAML, SOAP, and others are designed specifically for this process.

Unfortunately, there are de facto “rules” for exchanging data. Rules that must be followed but that the people using them do not understand.

Excel, Word, and Other MicroSoft Monstrosities

The default for most people when exchanging a table of information is an Excel file, an xls file. How I hate this.

An Excel file gives labels to values and adds formatting but does not add meaning. Meaning comes from external sources.

So, we might have a two cells we are looking at; B1 has a value of “Price”. It is formatted as bold text, centered. B2 has a value of 7.50. It is formatted as text, so it displays as “$7.50”. If the cell was formatted as “number” or “general” it would display as 7.5.

It is the user who applies these formatting rules. It is the user who provides meaning to these values.

If you have an application that can read and display Excel sheets, then all is good.

But Excel sheets are not a great way to exchange data. As a matter of fact, they suck. Each cell must contain both formatting and values. There are linkages between cells and a hundred other things that can be added. They are painful to create programmatically.

The Savior of Data Interchange, CSV

It doesn’t get any simpler than a comma-separated values file. They use the well-defined primitive type display rules; they are easy to generate, they are row independent, and they can be read by a simple text editor.

The biggest thing to understand is that CSV exchanges values. The meaning of those values is up to the receiver.

Which is why expectations and Excel suck.

It’s Bad Data, No! It’s Being Displayed Wrong!

When a normal user receives a CSV file, they want to open the file and view it. On Microsoft platforms, the program tasked to do this is Excel. On many Linux platforms, the tool is LibreOffice. For Solaris it was OpenOffice.

These tools import the CSV file. That import process can break things, badly.

By default, a comma separates each field. If a comma is part of the value, it must be escaped. Quotes are used to provide escaping. Quotes within quoted strings must also be properly handled.

So we end up with client bug reports like this: “Data is misaligned throughout”.

What does this mean? It means that the client hasn’t properly defined the type and formatting for the column. In this case, the SKU sometimes consists of just digits. When this happens, Excel treats the value as an integer type. By default, integers are displayed right justified. If the value has characters in it, then it is displayed as a left justified character string.

Once you tell Excel that the SKU column only contains text, then the alignment issue goes away.

“The price column is missing dollar signs” means that values are being displayed as floating point numbers, not as currency. Change the format to currency, and it all just works.

“There are symbols instead of letters.” means they get what they put in. The value stored in the database has an accent in it. Like résumé.

The problem happens when Excel imports a word with accents on their Microsoft platform. My browser and LibreOffice both use the same font set, so I see résumé. Their Excel on their Microsoft platform displays the accent as something like a copyright symbol, ©.

They see what they put into the database, but they are unhappy that Excel is doing exactly what they asked it to do.

2 thoughts on “It Is All Ones and Zeros”
  1. Re text being displayed wrong: for decades this was a very problematic area, because languages other than English need letters not found in ASCII. French and Swedish both have that need, but the specific extra characters are different. In 7 bits this was handled by “national characters” — things like square brackets were formally just national characters, a particular country’s or language’s decision what to use those codes for. An ASCII file with [ ] in it might show up on a Swedish terminal as having accented letters instead. Similarly, # on a US computer would become a pound-sterling sign on a British one.
    To fix this, DEC adopted an 8 bit code where the accents show up in the upper 128 codes. ISO took the idea and tweaked it slightly, calling it Latin-1. The “-1” was in recognition of the fact that this code works for all the western European languages, but not for (say) Hungarian, never mind Maltese or Greenlandic.
    Non-Latin scripts like Cyrillic (in many variants), Greek, Hebrew, etc. also fit in 8 bits but again as separate codes. Microsoft went way overboard with this using their “code pages” which also in standard MS fashion were entirely their own and different from any international standard.
    Then there are Chinese and Japanese, which have several thousand characters. Originally those were country-specific 16 bit sets, a couple for Chinese (red China vs other Chinese-speaking places) and a different one for Japanese.
    At some point the idea arrived to clean this up with an international set, a 16-bit generalization of the 8-bit Latin-n structures (with the first 32 code points of each 128 entry “page” treated as control characters). That was the original ISO 10646. It was published and instantly rejected vociferously, with a grass roots effort creating Unicode instead (which is a straight 16 bit code without those control character aberrations). That includes using a single code for Chinese and Japanese. I’ve long had my doubts about the sanity of doing that, but it was the choice. Eventually that became the new ISO 10646, with the original thing disappearing without a trace. And also eventually Unicode grew to be 20 bits instead of just 16, because 16 bits isn’t quite enough to do “all the characters of the world”.

    The end result is that if you use Unicode, and the applications assume that, all is well. If you have applications that default to something else, you get the wrong output. Unfortunately, Unicode (or UTF-8, the most common encoding) is not self-describing in its standard form, though once again MS has created a non-standard hack to pretend to make it self-describing.

  2. Those dang Massachusetts zip codes have been screwing up amateur mailing lists for decades. The spreadsheets auto-format 02XXX zip codes as numbers then drop the leading 0.

Leave a Reply to SJ Cancel reply

Your email address will not be published. Required fields are marked *