Disk Failures
I’ve talked about my Ceph cluster more than a bit. I’m sure you are bored with hearing about it.
Ceph uses two technologies to provide resilient storage. The first is by duplicating blocks, and the second is by Erasure coding.
In many modern systems, the hard drive controller allows for RAID configurations. The most commonly used RAID is RAID-0, or mirroring. Every block written is written to two different drives. If one drive fails, or if one sector fails, the data can be recovered from the other drive. This means that to store 1 GB of data, 2 GB of storage is required. In addition, the drives need to be matched in size.
Ceph wants at least 2 copies of each block. This means that to store 1 GB of data, 3 GB of storage is required.
Since duplicated data is not very efficient, different systems are used to provide the resilience required.
For RAID-5, a parity drive is added. When you have 3 or more drives, normally an odd number, one drive acts as a parity drive.
Parity is a simple method of determining if something was modified in a small data chunk. If you have a string of binary digits, 0xp110 1100 (a lowercase l in ASCII), the ‘p’ bit is the parity. We count the number of one bits in the byte and then set the p bit to make the count odd or even, depending on the agreement. If we say we are using odd parity, the value would be 0x1110 1100. There are 5 ones, which is odd.
If we were to receive 0x1111 1100, the parity would be even, telling us that what was transmitted is not what we received. A parity bit is described as single-bit detection, no correction.
Parity can get more complex, up to and including Hamming codes. A Hamming code uses multiple parity bits to create multi bit detection and one or more bit correction.
NASA uses, or used, Hamming codes for communications with distant probes. Because of limited memory on those probes, once data was transmitted, it wasn’t available to be retransmitted. NASA had to get the data right as it was received. By using Hamming codes, NASA was able to correct corrupted transmissions.
RAID-5 uses simple parity with knowledge of which device failed. Thus a RAID-5 device can handle a single drive failure.
So this interesting thing happened: the size of the drives got larger, and the size of the RAID devices got larger. The smart people claimed that with the number of drives in a RAID device, if a device failed, by the time the replacement device was in place, another drive would have failed.
They were wrong, but it is still a concern.
Ceph uses erasure coding the same way RAID uses parity drives, but erasure coding is more robust and resilient.
My Ceph cluster is set up with data pools that are simple replication pools (n=3) and erasure coded pools (k=2, m=2). Using the EC pools reduces the cost from 3x to 2x. I use EC pools for storing large amounts of data that does not change and which is not referenced often, such as tape backups.
The replication pools are used for things that are referenced frequently, where access times make a difference.
With the current system, I can handle losing a drive, a host, or a data closet without losing any data.
Which is good. I did lose a drive, I’ve been waiting to replace the dead drive until I had built out a new system. The new node was in the process of being built out when the old drive failed.
Unfortunately, I have another drive that is dying. Two dead drives is more than I want to have in the system. So I’ll be replacing the orginal dead drive today.
The other drive will get replaced next week.








