Alexander Leidinger

Just another weblog

Aug
10

EMC^2/Legato Net­worker 7.5.1 prob­lems

In July a col­league switched our backup sys­tem from a Net­worker (NW) server 7.1.1 with a single phys­ical tape lib­rary to a NW server 7.5.1 on a Sol­aris Cluster plus 2 stor­age nodes with 2 vir­tual tape lib­rar­ies and 2 phys­ical tape lib­rar­ies. He spe­cially de­cided to in­stall 7.5.1 in­stead of 7.5 (go­ing with 7.4 was not an op­tion he had be­cause of some re­quire­ments for the Share­point backup for our Win­dows people) in the be­gin­ning of the pro­ject sev­eral months ago, as he did not trust a “.0-release”. Everything went well un­til we crossed the point of no re­turn. Then we got a cor­rupt me­dia DB (meta data which tells which backup for which cli­ent is on which tape) just 3 days be­fore he went into hol­i­day. Yeah! :-(

As I was agree­ing to be his backup and pol­ish the NW setup dur­ing his hol­i­day, this res­ul­ted in a lot of over­time for me to get everything back into shape. The bad part of this was, that the in­cid­ent opened with the NW-support took about one a half day un­til it reached a point where someone was look­ing at try­ing to re­pro­duce the core dumps we where get­ting. Un­til then the stand­ard trial&error prob­lem fix­ing pro­ced­ure was done my the over­seas sup­port people (we had one sup­port guy on-site, and he did really good sup­port, and he was as­ton­ished about this be­ha­vior too). And this des­pite the fact the the help-request was on pri­or­ity one. We also had clearly stated sev­eral times in the ticket, that our me­dia DB was cor­rupt (with an empty me­dia DB everything was work­ing, just with the pop­u­lated me­dia DB the server crashed on star­tup). The prob­lem was, that you are not al­lowed to de­lete a cli­ent which has entries in the me­dia DB, without run­ning “nsrim –X” by hand or tak­ing a backup of the boot­strap (this runs “nsrim –X”). This is sup­posed to be fixed in 7.5.1.4 (re­leased be­cause of our prob­lem, it seems). The man­ager from EMC^2 which came be­cause of this was a poor guy. Ima­gine him be­ing there to rep­res­ent EMC^2, me and my boss to rep­res­ent the on-site ad­mins of our cli­ent, one of  the Win­dows people of our cli­ent, and 5 people from our cli­ent in the same room (this means 1 vs. 7)… and our cli­ent was not happy.

As we did not had a work­ing boot­strap backup (do not ask, a mix of bad luck and strange be­ha­vior of NW), we had to do a dis­aster re­cov­ery (this means scan­ning all tapes with backups we did so far). The bad part is, this will res­ult in some cli­ents which are sup­posed to have the same cli­ent ID to show up with a dif­fer­ent cli­ent ID. In plain eng­lish this means, the dis­aster re­cov­ery pro­ced­ure does not really work as ex­pec­ted. Yes, we still can re­cover the data, but it is not just start­ing the re­cover pro­gram and mak­ing your choice what to re­store. You have to first find the data in the me­dia DB (with the mminfo tool), print out the new cli­ent ID, cre­ate an entry in /etc/hosts on the NW-server with a dummy name (I de­cided to give them the IP 127.0.1.X, in case we need it), cre­ate a cli­ent in NW with the same dummy name (e.g. z-recover-<client-name>) and then to re­cover your data nor­mally. This is a bug for which we do not know yet if it is fixed in 7.5.1.4 or not. I can tell that this pro­ced­ure works, we had to re­cover some data just after we where in a state where it was pos­sible to re­cover at least some data (setup was OK, but not all tapes where scanned).

Ad­di­tion­ally to the other bugs we stumbled upon (e.g. mov­ing a VTL to the other stor­age node needs a re­start of the NW-server, NW will moan about the li­cense not be­ing OK and not giv­ing ac­cess to the VTL, if you do not re­start; if the clon­ing con­tacts the me­dia DB at the wrong time (when it is busy), the clon­ing will never start and stay in “server busy”-mode, even if it is not busy any­more; the post­cmd is run­ning one minute after the precmd, it is not de­tect­ing that a backup is still run­ning; …) I have to say that NW 7.5.1 is not really pro­duc­tion ready. For me NW 7.5.1 is beta qual­ity soft­ware. Dur­ing the time when I took care about fix­ing the prob­lems with our backup sys­tem, the one thing I told most of­ten to other people when they asked if I think the backup will be OK in the night was: please cross your fin­gers.

My col­league is back from hol­i­day, and on Wed­nes­day he is go­ing to in­stall 7.5.1.4. It is sup­posed to fix most of our prob­lems. I cross my fin­gers.

I can also re­port some good things about NW. The GUI is much bet­ter than the one from 7.1.1. And when 7.5.x will be as stable as our 7.1.1 ver­sion was (there where some prob­lems from time to time, but the work­around was to re­start the server and it was OK, as this did not hap­pen of­ten, it was OK for us), it will be really a nice backup sys­tem which we have here. And boy, this thing is fast (OK, a part of this is be­cause of the VTL in­stead of a phys­ical tape lib­rary, but NW is now also able to read from sev­eral tapes in par­al­lel for the same re­store — if the data is on mul­tiple tapes, off course).

Bot­tom line: if you use 7.5.x, up­date to 7.5.1.4 as soon as pos­sible.

Share/Save

Tags: , , , , , , , , ,

No Responses to “EMC^2/Legato Net­worker 7.5.1 prob­lems”

Leave a Reply