Recovery of a ReiserFS Drive with Bad Blocks

There doesn’t seem to be a good detailed HOW-TO, guide, or tutorial compiled on the subject of ReiserFS recovery from partial drive failure, so I figured I’d give it an ol’ writeup, seeing as I have some time to kill (about 10M blocks left to be corrected).

Before I start out, I should mention that good backup procedures are a really good idea. If you’re actually using this advice, you probably realize this post facto. All hard drives fail. No two ways about it, so if you care about your data, have good backup procedures. I heartily recommend rdiff-backup, which not only does rsync-based transfers, but also, as the name implies, keeps increments. For Mac, I recommend psync to a sparse-image DMG for preserving resource forks. Carbon Copy Cloner does some pretty useful automation (in addition to ghosting great). Backupninja looks like a good rdiff-backup script, but I haven’t used it personally. Now, if only I could find a good transparent compressed volume format…

All that being said, sometimes you can get sloppy or lazy. In my case, I was building a new RAID system when a drive on my old file server crapped out… (no need for condolensces, this happened a long time ago, I just got around to doing recovery now)

The first thing you want to do if you notice block errors is to copy everything off. You might want to check smartctl and hddtemp to see if there’s anything really horrible going on before you do that. If your data is really important, shut down and send it to the professionals immediately. Ontrack is the most high profile, but there are probably others that can help you. Otherwise, try to copy what you can elsewhere.

If your copies were sucessful, at this point you can probably chuck your drive, recovery won’t be worth your time. If you can’t mount, or your copy wasn’t successful, then you’ll want two things: a recovery drive with enough space to hold an image of your drive, your drive unmounted. In general I’m of the opinion that it’s better not to spindown — it might not come back up (if you have a drive that won’t spin back up, usually accompanied by a click-whir, and sending it for data recovery is out of the option, I recommend giving freezing a try, it’s worked for me in the past), but it’s sort of hard to say what’s more of a risk. If it appears to just be bad sectors, I’d say turn it off until you’re ready to proceed if you don’t have a recovery drive.

If the drive affected is your primary partition, you may need to reboot with a boot CD. RIP is good (the advantage is it has dd_rhelp preinstalled), but if you have more esoteric hardware (say a separate HighPoint controller) you may want to go directly to Knoppix.

Once you have your old drive [hdbad] and new drive [hdgood], you’ll want to run dd_rescue, or probably better, the dd_rescue helper script dd_rhelp:

dd_rescue -A -v /dev/[hdbad] /dev/hd[good]

This will replicate the current drive onto the new drive (you can dd_rescue to a disk image file instead and mount -o as loopback if you’d rather). From then this point on we’ll try our recovery on this duplicate copy. Note that this differs from what others have said. They recommend backing up (w/ dd) and then recovering on the bad drive. Please read what they have to say, but IMO that’s a bad idea:

  1. A plain dd won’t handle bad blocks correctly, so the backup probably will be messed up
  2. Working on the bad drive will probably make things degrade more, very quickly. Running badblocks -b 4096 actually led to creating more bad blocks (including my superblock, sigh)
  3. Even after feeding the bad blocks file into reiserfsck, it’ll still barf on bad hardware (again, speaking from firsthand experience)

So, I say, work on your dd_rescue’d image directly. This “backup” won’t do you any good if you can’t fix it anyway. Once you’ve moved your image onto a good drive, you can run a check

reiserfsck --check /dev/[hdgood]

It’ll tell you whether you’ve escaped unscathed or whether you’ll need to proceed through the levels of being ‘fscked’: --fix-fixable, --rebuild-tree, --rebuild-sb. reiserfsck is adequately descriptive in informing you on the level of your woes.

Note, I had more success w/ reiserfsck recovery running knoppix26 than the regular 2.4 kernel. If you run into an impasse w/ one you might want to try the other.

It’ll be a couple more hours before I know how this turns out.

Resources: