Cursed AMD64 box

July 9, 2006
AMD Athlon 64 x2 4400+ box has been working fine for about a year with a Western Digital 10KRPM SATA drive. This is a DFI Lanparty motherboard and a 450W PSU IIRC. The machine was up 24/7 for most of that year since it was acting as my mailserver amongst other things. A few weeks ago the drive started acting erratically, I would waken in the morning and find that the ext3 filesystem on there had been remounted read-only because filesystem corruption had been detected. I was able to fsck the filesystem back into sanity and the drive would act fine for several days. Well these stories always end the same way, with a drive that won't complete a boot, and that was the case for this idiot too. The particular disease was that the area of the disc that contained the LVM structure -- Fedora sends in LVM by default now -- was spewing hard IO errors when touched. Therefore it couldn't get past trying to bring up the LVM on boot and simply dropped dead. I documented the evasive actions I took on this fedora-list mail , basically I was able to recover the ext3 filesystem that was inside the LVM block on to a new SATA drive. "LVM"'s physical footprint is basically an 0x30000 byte header before the ext3 filesystem starts. I installed FC5 on the new drive and brought over most of the data from the copy of the ext3 filesystem from the damaged drive, and went on pretty much as normal, with brief interruptions while I fished something I had forgotten I needed from the old filesystem. But then to my disbelief, after just a week, the new drive -- the only drive in the machine -- blew chunks in a similar way, hard IO errors one morning. I came in my work room and heard it performing the click of death. I recovered from this rather grimly from backups, I did not fancy attempting a second recovery of 60GB of data from a second drive inside of a week. I stared at the AMD box for a minute or two though... I could think of two likely causes, the most likely one being the power supply. If it was having trouble with its 12VDC line, serious trouble, it might cause the drive to reset itself as if a poweron was happening repeatedly. It's not hard to imagine that a set of such resets at random intervals might eventually catch the drive out in its initialization phase and cause it to throw a fit ending it its head scratching the surface. The other possible cause is a bit more uncertain, both boxes were running the new FC5 2.6.17 kernel which has had a lot of work going on with libata and the kernel code for SATA. I wonder if that is repeatedly attempting drive resets as a last resort. Anyway it had caused enough trouble, I swore off it and migrated back to running from this Centrino Duo laptop, it is plenty fast enough for a main workstation. One nice feature of vmware is that the XP I am running inside it has no idea that it has moved machine, there is no activation crap -- although this is of course a genuine retail copy of XP, one of two I own. I shall probably have cause to write about it another time but I have to have XP for Protel. It runs on top of Fedora Core thanks to Vmware workstation.