EMC^2/Legato Net­work­er status

The up­date to went fine. No ma­jor prob­lems en­countered. So far we did not see any re­gres­sions. The com­plete sys­tem feels a little bit more stable (no re­starts ne­ces­sary so far, be­fore some where ne­ces­sary from time to time). We still have to test all our prob­lem cases:

  • re­start NW-​server dir­ectly af­ter de­let­ing a cli­ent with in­dex entries (manu­al copy of /​nsr needed be­fore, in case the me­diadb cor­rup­tion bug is not fixed as prom­ised)
  • shut­down a stor­age node to test if the NW-​server still crashes in this case
  • start with an empty me­diadb but pop­u­lated cli­ents (empty /​nsr/​mm, but un­touched /​nsr/​res) and scan some tapes to check if “shad­ow cli­ents” (my term for cli­ents which have the same cli­ent ID but get newly cre­ated dur­ing the scan­ning with a new cli­ent ID and a name of “~<original-name>-<number>”) still get cre­ated in­stead of pop­u­lat­ing the in­dex of the cor­rect cli­ent

The first two ones are sup­posed to be fixed, the last one is may­be not fixed.

Not fixed (ac­cord­ing to the sup­port) is the prob­lem of need­ing a re­start of the NW-​server when mov­ing a tape lib­rary from one stor­age node to an­other stor­age node. It also seems that our prob­lem with the manu­al clon­ing of save sets is not solved. There are still some clone pro­cesses which do not get out of the “server busy” loop, no mat­ter how idle the NW-​server is. In this case it can be seen that ns­r­clone is wait­ing in nanosleep (use pstack or dtrace to see it). The strange thing is, that a safe set which is “fail­ing” with such be­ha­vi­or will al­ways cause this be­ha­vi­or. We need to have a deep­er look to see if we find sim­il­ar­it­ies between such safe sets and dif­fer­ences to safe sets which can be cloned without prob­lems.

Leave a Reply

Your email address will not be published. Required fields are marked *