ZFS & power-failure: stable

At the week­end there was a power-failure at our disaster-recovery-site. As every­thing should be con­nect­ed to the UPS, this should not have had an impact… unfor­tu­nate­ly the guys respon­si­ble for the cabling seem to have not pro­vid­ed enough pow­er con­nec­tions from the UPS. Result: one of our stor­age sys­tems (all vol­umes in sev­er­al RAID5 vir­tu­al disks) for the test sys­tems lost pow­er, 10 hard­disks switched into failed state when the pow­er was sta­ble again (I was told there where sev­er­al small power-failures that day). After telling the soft­ware to have a look at the dri­ves again, all phys­i­cal disks where accepted.

All vol­umes on one of the vir­tu­al disks where dam­aged (actu­al­ly, one of the vir­tu­al disks was dam­aged) beyond repair and we had to recov­er from backup.

All ZFS based mount­points on the good vir­tu­al disks did not show bad behav­ior (zfs clear + zfs scrub for those which showed check­sum errors to make us feel bet­ter). For the UFS based ones… some caused a pan­ic after reboot and we had to run fsck on them before try­ing a sec­ond boot.

We spend a lot more time to get UFS back online, than get­ting ZFS back online. After this expe­ri­ence it looks like our future Solaris 10u8 installs will be with root on ZFS (our work­sta­tions are already like this, but our servers are still at Solaris 10u6).

EMC^2/Legato Net­work­er 7.5.1.6 status

We updat­ed Net­work­er 7.5.1.4 to 7.5.1.6 as the Networker-Support thought it will fix at least one of our prob­lems (“ghost” vol­umes in the DB). Unfor­tu­nate­ly the update does not fix any bug we see in our environment.

Spe­cial­ly for the “post-command runs 1 minute after pre-command even if the back­up is not finished”-bug this is not sat­is­fy­ing: no con­sis­tent DB back­up where the appli­ca­tion has to be stopped togeth­er with the DB to get a con­sis­tent snap­shot (FS+DB in sync).