Alexander Leidinger

Just another weblog

Aug
10

EMC^2/Legato Net­worker 7.5.1 problems

In July a col­league switched our backup sys­tem from a Net­worker (NW) server 7.1.1 with a sin­gle phys­i­cal tape library to a NW server 7.5.1 on a Solaris Clus­ter plus 2 stor­age nodes with 2 vir­tual tape libraries and 2 phys­i­cal tape libraries. He spe­cially decided to install 7.5.1 instead of 7.5 (going with 7.4 was not an option he had because of some require­ments for the Share­point backup for our Win­dows peo­ple) in the begin­ning of the project sev­eral months ago, as he did not trust a “.0-release”. Every­thing went well until we crossed the point of no return. Then we got a cor­rupt media DB (meta data which tells which backup for which client is on which tape) just 3 days before he went into hol­i­day. Yeah! :-(

As I was agree­ing to be his backup and pol­ish the NW setup dur­ing his hol­i­day, this resulted in a lot of over­time for me to get every­thing back into shape. The bad part of this was, that the inci­dent opened with the NW-support took about one a half day until it reached a point where some­one was look­ing at try­ing to repro­duce the core dumps we where get­ting. Until then the stan­dard trial&error prob­lem fix­ing pro­ce­dure was done my the over­seas sup­port peo­ple (we had one sup­port guy on-site, and he did really good sup­port, and he was aston­ished about this behav­ior too). And this despite the fact the the help-request was on pri­or­ity one. We also had clearly stated sev­eral times in the ticket, that our media DB was cor­rupt (with an empty media DB every­thing was work­ing, just with the pop­u­lated media DB the server crashed on startup). The prob­lem was, that you are not allowed to delete a client which has entries in the media DB, with­out run­ning “nsrim –X” by hand or tak­ing a backup of the boot­strap (this runs “nsrim –X”). This is sup­posed to be fixed in 7.5.1.4 (released because of our prob­lem, it seems). The man­ager from EMC^2 which came because of this was a poor guy. Imag­ine him being there to rep­re­sent EMC^2, me and my boss to rep­re­sent the on-site admins of our client, one of  the Win­dows peo­ple of our client, and 5 peo­ple from our client in the same room (this means 1 vs. 7)… and our client was not happy.

As we did not had a work­ing boot­strap backup (do not ask, a mix of bad luck and strange behav­ior of NW), we had to do a dis­as­ter recov­ery (this means scan­ning all tapes with back­ups we did so far). The bad part is, this will result in some clients which are sup­posed to have the same client ID to show up with a dif­fer­ent client ID. In plain eng­lish this means, the dis­as­ter recov­ery pro­ce­dure does not really work as expected. Yes, we still can recover the data, but it is not just start­ing the recover pro­gram and mak­ing your choice what to restore. You have to first find the data in the media DB (with the mminfo tool), print out the new client ID, cre­ate an entry in /etc/hosts on the NW-server with a dummy name (I decided to give them the IP 127.0.1.X, in case we need it), cre­ate a client in NW with the same dummy name (e.g. z-recover-<client-name>) and then to recover your data nor­mally. This is a bug for which we do not know yet if it is fixed in 7.5.1.4 or not. I can tell that this pro­ce­dure works, we had to recover some data just after we where in a state where it was pos­si­ble to recover at least some data (setup was OK, but not all tapes where scanned).

Addi­tion­ally to the other bugs we stum­bled upon (e.g. mov­ing a VTL to the other stor­age node needs a restart of the NW-server, NW will moan about the license not being OK and not giv­ing access to the VTL, if you do not restart; if the cloning con­tacts the media DB at the wrong time (when it is busy), the cloning will never start and stay in “server busy”-mode, even if it is not busy any­more; the postcmd is run­ning one minute after the precmd, it is not detect­ing that a backup is still run­ning; …) I have to say that NW 7.5.1 is not really pro­duc­tion ready. For me NW 7.5.1 is beta qual­ity soft­ware. Dur­ing the time when I took care about fix­ing the prob­lems with our backup sys­tem, the one thing I told most often to other peo­ple when they asked if I think the backup will be OK in the night was: please cross your fingers.

My col­league is back from hol­i­day, and on Wednes­day he is going to install 7.5.1.4. It is sup­posed to fix most of our prob­lems. I cross my fingers.

I can also report some good things about NW. The GUI is much bet­ter than the one from 7.1.1. And when 7.5.x will be as sta­ble as our 7.1.1 ver­sion was (there where some prob­lems from time to time, but the workaround was to restart the server and it was OK, as this did not hap­pen often, it was OK for us), it will be really a nice backup sys­tem which we have here. And boy, this thing is fast (OK, a part of this is because of the VTL instead of a phys­i­cal tape library, but NW is now also able to read from sev­eral tapes in par­al­lel for the same restore — if the data is on mul­ti­ple tapes, off course).

Bot­tom line: if you use 7.5.x, update to 7.5.1.4 as soon as possible.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,

No Responses to “EMC^2/Legato Net­worker 7.5.1 problems”

Leave a Reply