EMC^2/Legato Net­work­er 7.5.1.6 status

We updat­ed Net­work­er 7.5.1.4 to 7.5.1.6 as the Networker-Support thought it will fix at least one of our prob­lems (“ghost” vol­umes in the DB). Unfor­tu­nate­ly the update does not fix any bug we see in our environment.

Spe­cial­ly for the “post-command runs 1 minute after pre-command even if the back­up is not finished”-bug this is not sat­is­fy­ing: no con­sis­tent DB back­up where the appli­ca­tion has to be stopped togeth­er with the DB to get a con­sis­tent snap­shot (FS+DB in sync).

EMC^2/Legato Net­work­er 7.5.1.4 tests

Regard­ing our last prob­lems with NW:

  • OK: the “restart NW-server direct­ly after delet­ing a client with index entries”-crash is fixed
  • Most­ly OK: shut­ting down a stor­age node does not crash the NW-server any­more… most of the time (some­times there is some strange behav­ior in this regard, we do not have enough evi­dence, but there may be still some sleep­ing dragons)
  • ?: we did not yet check the dis­as­ter recov­er part
  • NOK: the post-cmd is still run one minute after the pre-cmd in some cas­es, maybe this is relat­ed to a session/save-set which is not yet start­ed but the pre-cmd is already run, if this is the case, this could maybe also affect the case where there is more than one minute of delay between the end of one session/save-set on a machine and the start of anoth­er session/save-set on the same machine (the sup­port is investigating)
  • NOK: some Oracle-RMAN back­ups (cus­tom save com­mand, perl script) show a run­ning ses­sion in the NW-monitoring and some do not, after the back­up mmin­fo some­times lists the group of a RMAN-save-set and some­times not (for the same client), under inves­ti­ga­tion by the support

So, for us 7.5.1.4 is still a beta version.

EMC^2/Legato Net­work­er 7.5.1.4 status

The update to 7.5.1.4 went fine. No major prob­lems encoun­tered. So far we did not see any regres­sions. The com­plete sys­tem feels a lit­tle bit more sta­ble (no restarts nec­es­sary so far, before some where nec­es­sary from time to time). We still have to test all our prob­lem cases:

  • restart NW-server direct­ly after delet­ing a client with index entries (man­u­al copy of /nsr need­ed before, in case the medi­adb cor­rup­tion bug is not fixed as promised)
  • shut­down a stor­age node to test if the NW-server still crash­es in this case
  • start with an emp­ty medi­adb but pop­u­lat­ed clients (emp­ty /nsr/mm, but untouched /nsr/res) and scan some tapes to check if “shad­ow clients” (my term for clients which have the same client ID but get new­ly cre­at­ed dur­ing the scan­ning with a new client ID and a name of “~<original-name>-<number>”) still get cre­at­ed instead of pop­u­lat­ing the index of the cor­rect client

The first two ones are sup­posed to be fixed, the last one is maybe not fixed.

Not fixed (accord­ing to the sup­port) is the prob­lem of need­ing a restart of the NW-server when mov­ing a tape library from one stor­age node to anoth­er stor­age node. It also seems that our prob­lem with the man­u­al cloning of save sets is not solved. There are still some clone process­es which do not get out of the “serv­er busy” loop, no mat­ter how idle the NW-server is. In this case it can be seen that nsr­clone is wait­ing in nanosleep (use pstack or dtrace to see it). The strange thing is, that a safe set which is “fail­ing” with such behav­ior will always cause this behav­ior. We need to have a deep­er look to see if we find sim­i­lar­i­ties between such safe sets and dif­fer­ences to safe sets which can be cloned with­out problems.