Shielding your files with Reed-Solomon codes

(July 2008)

(Based on an idea I had a decade ago)

Changelog:

August 3, 2008:	Slashdotted!
August 4, 2008:	Made a plain-C package available, to support 64-bit OSes (as well as OS/X and Cygwin users).
March 9, 2009:	Made a FUSE-based filesystem that transparently uses these tools.
June 14, 2010:	Fixed memory issues reported by Valgrind - now works with all GCC versions.
October 10, 2020:	Streaming support, a hands-on corruption example, and dd options

Shield my files? Why?

You know why!

Have you never lost a file because of storage media failure? That is, have a hard drive or USB stick lose a bunch of sectors (bad sectors) that simply happened to be the ones hosting parts (or all) of your file?

I have. More than once... :‑)

The way storage media quality has been going in the last years, it is bound to happen to you, too. When it does, believe me, you'll start to think seriously about ways to protect your data. And you'll realize that there's quite a lot of technology to choose from...

"Backup, backup, backup! And take it with you!" This is a valid and wise suggestion, but it doesn't address the details of backing up... There's more than one way to backup ; my personal favorite has a lot of features you'll probably like. Unfortunately, backups themselves are also stored in some kind of storage, so the question is how are you certain that your backup storage won't fail? And equally important, how often will you backup? If you do it once per week, you might end up losing a week's worth of work - is that acceptable in your line of work?
Others will advocate RAID. Use a RAID scheme on more than one disks, and when one fails, the machine will keep on working with the rest - at least in theory. In practice, faulty RAID controllers (especially the on-board garden variety) can wreck havoc just as much as the faulty storage media can. If you decide to go for RAID, I suggest you use your OS support for software RAID (e.g. Linux md), and I also suggest using the simplest possible building blocks: RAID1 (mirrors), or if you really need speed, RAID10 (stripes of mirrors). Even with RAID though, nothing can protect you from "silent corruption" errors - e.g. IDE cable errors between the controller and the drivers, faulty power supplies, etc.
Then again, neither backup nor RAID would save you from accidental deletions or file corruptions. Today's word processors (thinking of Mr. Clippy, not LaTEX) are such complex beasts that their crashing is considered a normal everyday activity (which is why they were "enhanced" years ago with periodic auto-saves). On one of these crashes, chances are you'll find your document corrupted. A solution, you ask? Simple: Use version control... Subversion or Git are wonders of the world - the former even has nice GUIs for non-technical folk. You can then recover from deletions and corruptions, since your repository would provide the file again.
Then again, you may be forced to work in "lone wolf" mode... Working with your laptop in airport lounges and dark, secluded caves (known as "hotels"). Access to the web may be missing or firewalled, and therefore there may be no way to hook your laptop to your company's repository...
Burning to re-writable DVDs? Chances are that when disaster strikes, you will find your precious backup DVD is scratched... Or that your USB stick in your keychain didn't survive the constant scratching from your keys...

My point?

There's no such thing as "enough protection" for your data - the more you have, the better the chances that your data will survive disasters.

What follows is a simple description of a way I use to additionally "shield" my important files, so that even if some sectors hosting them are lost, I still end up salvaging everything.

Algorithm

The idea behind this process is error correcting codes, like for example the ubiquitous Reed-Solomon. With Reed-Solomon, parity bytes are used to protect a block of data from a specified maximum number of errors per block. In the tools described below, a block of 223 bytes is shielded with 32 bytes of parity. The original 223 bytes are then morphed into 255 "shielded" ones, and can be recovered even if 16 bytes from inside the "shielded" block turn to noise...

Storage media are of course block devices, that work or fail on 512-byte sector boundaries (for hard disks and floppies, at least - in CDs and DVDs the sector size is 2048 bytes). This is why the shielded stream must be interleaved every N bytes (that is, the encoded bytes must be placed in the shielded file at offsets 1,N,2N,...,2,2+N,etc): In this way, 512 shielded blocks pass through each sector (for 512 byte sectors), and if a sector becomes defective, only one byte is lost in each of the shielded 255-byte blocks that pass through this sector. The algorithm can handle 16 of those errors, so data will only be lost if sector i, sector i+N, sector i+2N, ... up to sector i+15N are lost! Taking into account the fact that sector errors are local events (in terms of storage space), chances are quite high that the file will be completely recovered, even if a large number of sectors (in this implementation: up to 127 consecutive ones) are lost.

I implemented this scheme back in 2000 for my diskettes (remember them?). Recently, I discovered that Debian comes with a similar utility called rsbep, which after a few modifications is perfect for providing adequate shielding to your files.

Download

Here is the source code for my customization of rsbep, a utility that implements the kind of Reed-Solomon-based "shielding" that we talked about (the customized code is also available from my GitHub repo). The package includes 32-bit x86 assembly that makes it an order of magnitude faster than plain C ; if however you are not on a 32bit x86 platform, it will fallback to a portable C version instead (a lot slower, unfortunately). rsbep is part of dvbackup, so some Debian users might already have it installed; my version however addresses some issues toward the goal we are seeking here, which is error-resiliency for files against the common, bursty types of media errors. More information on what was changed is below.

The package is easily installed under Linux, Mac OS/X, Windows(cygwin) and Free/Net/OpenBSD, with the usual

./configure 
make 
make install

Results

Here is a self-healing session in action:

home:/var/tmp/recovery$ ls -la
total 4108
drwxr-xr-x 2 ttsiod ttsiod    4096 2008-07-30 22:21 .
drwxrwxrwt 5 root   root      4096 2008-07-30 22:21 ..
-rw-r--r-- 1 ttsiod ttsiod 4194304 2008-07-30 22:21 data

home:/var/tmp/recovery$ freeze.sh data > data.shielded
home:/var/tmp/recovery$ ls -la
total 9204
drwxr-xr-x 2 ttsiod ttsiod    4096 2008-07-30 22:21 .
drwxrwxrwt 5 root   root      4096 2008-07-30 22:21 ..
-rw-r--r-- 1 ttsiod ttsiod 4194304 2008-07-30 22:21 data
-rw-r--r-- 1 ttsiod ttsiod 5202000 2008-07-30 22:21 data.shielded

home:/var/tmp/recovery$ melt.sh data.shielded > data2
home:/var/tmp/recovery$ md5sum data data2
9440c7d2ff545de1ff340e7a81a53efb  data
9440c7d2ff545de1ff340e7a81a53efb  data2

home:/var/tmp/recovery$ echo Will now create artificial corruption 
home:/var/tmp/recovery$ echo of 127 times 512 which is 65024 bytes

home:/var/tmp/recovery$ dd if=/dev/zero of=data.shielded bs=512 \
			    count=127 conv=notrunc
127+0 records in
127+0 records out
65024 bytes (65 kB) copied, 0,00026734 seconds, 243 MB/s

home:/var/tmp/recovery$ melt.sh data.shielded > data3

rsbep: number of corrected failures   : 64764
rsbep: number of uncorrectable blocks : 0

home:/var/tmp/recovery$ md5sum data data2 data3
9440c7d2ff545de1ff340e7a81a53efb  data
9440c7d2ff545de1ff340e7a81a53efb  data2
9440c7d2ff545de1ff340e7a81a53efb  data3

For those of you that don't speak UNIX, what you see above is a simple exercise in destruction: we "shield" a file with the freeze.sh script, which is part of my package; we then melt.sh the frozen file, and verify (through md5sum) that the new generated file is exactly the same as the original one. We then proceed to deliberately destroy 64KB of the shielded file (that's a lot of consecutive sectors!), using dd to overwrite 127 sectors with zeros. We invoke melt.sh again, and we see that the new generated file (data3) has the same MD5 sum as the original one - it was recovered perfectly.

Reed-Solomon FS (a FUSE-based filesystem)

Based on these tools, I did a quick implementation of a Reed-Solomon protected filesystem, using Python/FUSE bindings:

bash$ poorZFS.py -f /reed-solomoned-data /strong

This command will mount a FUSE-based filesystem in /strong (using the /reed-solomoned-data directory to store the actual files and their "shielded" versions). Any file you create in /strong, will in fact exist under /reed-solomoned-data and will also be shielded there (via freeze.sh). When opening for reading any file in /strong, data corruption is detected (via melt.sh) and in case of corruption the file will be corrected using the Reed-Solomon "shielded" version of the file (which is stored alongside the original, and named as originalFilename.frozen.RS). The .frozen.RS versions of the files are not visible in the /strong directory, and are automatically created (in /reed-solomoned-data) when a file (opened for writing or appending) is closed.

I coded this mini-fs using Python-FUSE in a couple of hours on a boring Sunday afternoon, so don't trust your bank account data with it... It's just a proof of concept (not to mention dog-slow - due to the necessary data interleaving). Still, if your machine is only equipped with one drive, this will in fact transparently shield you against bad sectors, faulty power supplies, messy IDE cabling, etc.

Note: I coded this filesystem adding 20 or so lines of Python (spawning my freeze/melt scripts) into the Python/FUSE basic example. Anyone who has ever coded a filesystem driver for Windows knows why this justifies a heart attack - FUSE (and Python/FUSE) rock!

Changeset from original rsbep

In case you are wondering why I had to modify rsbep here's where my version differs from the original...

The original version wrote 3 parameters of Reed-Solomon as a single line before the "shielded" data, and this made the stream fragile (if this information was lost, decoding failed...)
It uses a default value of 16*255=4080 for parameter R, and it can thus tolerate 4080*16=65280 consecutive bytes to be lost anywhere in the stream, and still recover...
It adds file size information in the shielded stream, so the recovery process re-creates an exact copy of the original.
I added autoconf/automake support, to detect whether a fast 32bit x86 asm version can be used and otherwise fall back to a plain C (slow) implementation. The tools thus compile and install cleanly on many operating systems (Linux, Mac OS/X, Free/Net/OpenBSD, even Windows with Cygwin).
Python-FUSE support.

UPDATE, many years later: Streaming support, a hands-on corruption, and dd options

While answering some questions I received about usage of rsbep for streaming processes, I realized I could demonstrate dd’s recovery from actual corruption at raw device level, via the functionality offered by dmsetup’s error. This is a mandatory part that rsbep depends on, i.e. that when errors happen during reading, we still get some data, any data for them. We - i.e. the algorithm - can then recover the lost data.

The example below will create a device formed from two loop devices, with an erroneous zone between them.

First, we create the two loop devices, serializing their data into two 1MB files:

# mkdir example
# cd example
# truncate -s 1M a.img b.img
# losetup -f a.img
# losetup -f b.img
# losetup -a

/dev/loop1: [65024]:7984102 (/home/ttsiod/tmp/Milan/b.img)
/dev/loop0: [65024]:7984101 (/home/ttsiod/tmp/Milan/a.img)

Now let’s fill up the devices with data:

# i=0; while printf 'A%06d' $i ; do i=$((i+1)) ; done > /dev/loop0
-bash: printf: write error: No space left on device

# i=0; while printf 'B%06d' $i ; do i=$((i+1)) ; done > /dev/loop1
-bash: printf: write error: No space left on device

This wrote a series of counters in them, one after the other:

# hexdump -C /dev/loop0 | head -3
00000000  41 30 30 30 30 30 30 41  30 30 30 30 30 31 41 30  |A000000A000001A0|
00000010  30 30 30 30 32 41 30 30  30 30 30 33 41 30 30 30  |00002A000003A000|
00000020  30 30 34 41 30 30 30 30  30 35 41 30 30 30 30 30  |004A000005A00000|

Now let’s create the joined-and-errored device:

# dmsetup create DeviceWithBadSectors << EOF
0 2000 linear /dev/loop0 0
2000 96 error
2096 2000 linear /dev/loop1 48
EOF

# blockdev --getsz /dev/mapper/DeviceWithBadSectors
4096

This new device (/dev/mapper/DeviceWithBadSectors) is made of the first 2000 sectors of /dev/loop0, followed by 96 bad sectors; and then by the last 2000 sectors from /dev/loop1. This, as we saw above, makes up for a device with a total of 4096 sectors, 96 of which - in the middle - are bad.

Now let’s try reading the data of this device - first, with ddrescue, a tool specifically made to read from bad devices:

# ddrescue /dev/mapper/DeviceWithBadSectors recovered
GNU ddrescue 1.25
Press Ctrl-C to interrupt
     ipos:    1072 kB, non-trimmed:        0 B,  current rate:   2048 kB/s
     opos:    1072 kB, non-scraped:        0 B,  average rate:   2048 kB/s
non-tried:        0 B,  bad-sector:    49152 B,    error rate:    139 kB/s
  rescued:    2048 kB,   bad areas:        1,        run time:          0s
pct rescued:   97.65%, read errors:       98,  remaining time:          0s
                              time since last successful read:         n/a
Finished

# ls -l recovered
-rw-r--r-- 1 root root 2097152 Oct 10 13:41 recovered

# blockdev --getsz recovered
4096

Indeed, we got data for all 4096 sectors - including the 96 bad ones, for which ddrescue will have placed zeroes in the output. This is exactly what rsbep needs to happen for the Reed-Solomon algorithm to function properly; i.e. we need the data from the bad sectors to be there - bad, but in their place. We can’t afford to miss them!

OK - but ddrescue writes into a file. What about streaming operations? And also, most people won’t have it installed - can’t we use dd for the same purpose?

Let’s establish what is the output data we want - what is the MD5 checksum of the recovered data?

# md5sum recovered
d2ae90b3a830d34c7e92717e8549b909  recovered

Now let’s see what happens with dd - used with the proper options:

# dd if=/dev/mapper/DeviceWithBadSectors conv=noerror,sync bs=512 > /dev/null
...
dd: error reading '/dev/mapper/DeviceWithBadSectors': Input/output error
2000+95 records in
2095+0 records out
1072640 bytes (1.1 MB, 1.0 MiB) copied, 0.00982337 s, 109 MB/s
4000+96 records in
4096+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.024313 s, 86.3 MB/s

# dd if=/dev/mapper/DeviceWithBadSectors conv=noerror,sync bs=512 2>/dev/null | wc -c
2097152

# dd if=/dev/mapper/DeviceWithBadSectors conv=noerror,sync bs=512 2>/dev/null | md5sum
d2ae90b3a830d34c7e92717e8549b909

So we see that dd read the same data as ddrescue - the MD5 checksum is good, and we have indeed read 2097152 bytes - i.e. 4096 sectors of 512 bytes each. This means rsbep will be able to perfectly recover, even though we are streaming the data out.

The “magic” options for dd, are, as shown above:

noerror: don’t stop on hard drive read error, and
sync: pad every input block with NULs to input block size

So, we now have all the ingredients to use rsbep in a streaming scenario… e.g. sending data over SSH to another machine, or whatever.

Here’s a standalone example via tar - first, creating a backup:

# cd /path/to/backup
# tar cpf - ./ | xz | \
    rsbep  -B 255 -D 223 -R 4080 > /var/tmp/shielded.data.xz.rsbep

And here’s recovering:

# cd /path/to/restore
# dd if=/var/tmp/shielded.data.xz.rsbep conv=noerror,sync bs=512 | \
    rsbep -d -B 255 -D 223 -R 4080 | tar Jtvf -

Conclusion

These tools works fine for me, and I always use them when I backup data or move them around (e.g. from work to home). As an example, when I move my Git repository around, I always...

cd /path/to/mygit/
git gc 
cd ..
git clone --bare mygit mygit.bare 
tar jcpf mygit.tar.bz mygit.bare
freeze.sh mygit.tar.bz2 > /mnt/usbStick/mygit.tar.bz2.shielded

If you so wish, feel free to add a GUI layer over them... (I am a console kind of guy - I never code GUIs unless I really have to :‑)
And yes, they have already saved my data a couple of times.

Index

Updated: Sat Oct 8 12:33:59 2022