ZFS & power-failure: stable

At the week­end there was a power-failure at our disaster-recovery-site. As every­thing should be con­nect­ed to the UPS, this should not have had an impact… unfor­tu­nate­ly the guys respon­si­ble for the cabling seem to have not pro­vid­ed enough pow­er con­nec­tions from the UPS. Result: one of our stor­age sys­tems (all vol­umes in sev­er­al RAID5 vir­tu­al disks) for the test sys­tems lost pow­er, 10 hard­disks switched into failed state when the pow­er was sta­ble again (I was told there where sev­er­al small power-failures that day). After telling the soft­ware to have a look at the dri­ves again, all phys­i­cal disks where accepted.

All vol­umes on one of the vir­tu­al disks where dam­aged (actu­al­ly, one of the vir­tu­al disks was dam­aged) beyond repair and we had to recov­er from backup.

All ZFS based mount­points on the good vir­tu­al disks did not show bad behav­ior (zfs clear + zfs scrub for those which showed check­sum errors to make us feel bet­ter). For the UFS based ones… some caused a pan­ic after reboot and we had to run fsck on them before try­ing a sec­ond boot.

We spend a lot more time to get UFS back online, than get­ting ZFS back online. After this expe­ri­ence it looks like our future Solaris 10u8 installs will be with root on ZFS (our work­sta­tions are already like this, but our servers are still at Solaris 10u6).