At the weekend there was a power-failure at our disaster-recovery-site. As everything should be connected to the UPS, this should not have had an impact… unfortunately the guys responsible for the cabling seem to have not provided enough power connections from the UPS. Result: one of our storage systems (all volumes in several RAID5 virtual disks) for the test systems lost power, 10 harddisks switched into failed state when the power was stable again (I was told there where several small power-failures that day). After telling the software to have a look at the drives again, all physical disks where accepted.
All volumes on one of the virtual disks where damaged (actually, one of the virtual disks was damaged) beyond repair and we had to recover from backup.
All ZFS based mountpoints on the good virtual disks did not show bad behavior (zfs clear + zfs scrub for those which showed checksum errors to make us feel better). For the UFS based ones… some caused a panic after reboot and we had to run fsck on them before trying a second boot.
We spend a lot more time to get UFS back online, than getting ZFS back online. After this experience it looks like our future Solaris 10u8 installs will be with root on ZFS (our workstations are already like this, but our servers are still at Solaris 10u6).
Tags: bad behavior, checksum errors, disaster recovery, harddisks, power connections, power failure, power failures, storage systems, test systems, virtual disks —