๐ A story of file corruption, with a happy ending
๐ก Newskategorie: Linux Tipps
๐ Quelle: reddit.com
Ten years ago, I finally set up my first raid. Three Samsung EcoGreen F4 2TB hard drives. I felt real fancy.
I had a bunch of rare, random file corruption. No hard drive errors, no operating system errors, nothing. Just files gone. Seriously demoralizing.
Part of the problem, I realized, to my great horror, is that the default file systems of all operating systems, including ext4, do not include any checksums: There's no way to detect file corruption. So even if I had a backup to restore the corrupted files from, there was no way to find out what files were corrupted.
Six years later, four years ago, I ended up replacing all four of my hard drives, without finding a source of the problem, and the problem went away. Not satisfying.
Fortunately, on linux, there is a well supported file system with checksums all over the place: zfs. It even includes functionality similar to raid5 but better, so you can have a hard drive fail, without losing data. Called raidz1. Which uses one drive for parity, basically another layer of checksums.
On my new drives, I used zfs raidz1.
Recently, I learned raidz1 is not recommended on drives over 1TB, because of the higher chance that when a drive fails, and you replace it, the rebuild will cause a second drive failure. My drives are 3TB, so I'm planning to upgrade to raidz2, which can handle 2 drive failures.
Meanwhile, I decided to copy the 2TB of data I had only on that raidz1 to one of those old drives with the data corruption, using zfs. An unreliable copy, which I can verify, is better than no copy.
During the 6 hour copy, I checked the hard drive for hardware errors with smartctl. It gave me a dire warning. This model of hard drive, when any program, in any operating system, asks what model it is, will give up on any file write it was in the middle of, to answer you. That's bad. How uncommonly bad was this?
The page about the cause of the corruption: https://www.smartmontools.org/wiki/SamsungF4EGBadBlocks
Finally, after all these years, I know the cause of the corruption. I had never been entirely sure it was specific to those hard drives. There is a firmware upgrade to fix it. The program that warned me about the problem also triggered the problem. I wanted to say "unfortunately", but honestly, it was a joy to witness zfs doing its thing, on the same problem that was so traumatizing to me.
After the 6 hour copy, I ran a zpool scrub, which verifies all the checksums in the zfs file system. It also took 6 hours, and told me 11 files were corrupted. I copied those files again from the source. After running another zpool scrub, zpool status is showing no errors.
I upgraded the hard drive's firmware, fixing the corruption problem. It was a dos executable, I used this method: https://wiki.archlinux.org/index.php/Flashing_BIOS_from_Linux#Using_a_FreeDOS-provided_Disk_Image_+_USB_stick_on_Linux
I now have all my files on at least two file systems, and can handle any 2 hard drive failures.
It is so immensely satisfying to finally defeat this demon.
Hard drives fail. Back up your shit. Use a file system with checksums if you can. I'm planning to build a computer out of old parts (primarily with these old 2TB F4EG drives), to finally keep a backup on a different machine. And hopefully keep it at somebody else's house.
I love zfs.
[link] [comments] ...