In July a colleague switched our backup system from a Networker (NW) server 7.1.1 with a single physical tape library to a NW server 7.5.1 on a Solaris Cluster plus 2 storage nodes with 2 virtual tape libraries and 2 physical tape libraries. He specially decided to install 7.5.1 instead of 7.5 (going with 7.4 was not an option he had because of some requirements for the Sharepoint backup for our Windows people) in the beginning of the project several months ago, as he did not trust a “.0-release”. Everything went well until we crossed the point of no return. Then we got a corrupt media DB (meta data which tells which backup for which client is on which tape) just 3 days before he went into holiday. Yeah!
As I was agreeing to be his backup and polish the NW setup during his holiday, this resulted in a lot of overtime for me to get everything back into shape. The bad part of this was, that the incident opened with the NW-support took about one a half day until it reached a point where someone was looking at trying to reproduce the core dumps we where getting. Until then the standard trial&error problem fixing procedure was done my the overseas support people (we had one support guy on-site, and he did really good support, and he was astonished about this behavior too). And this despite the fact the the help-request was on priority one. We also had clearly stated several times in the ticket, that our media DB was corrupt (with an empty media DB everything was working, just with the populated media DB the server crashed on startup). The problem was, that you are not allowed to delete a client which has entries in the media DB, without running “nsrim –X” by hand or taking a backup of the bootstrap (this runs “nsrim –X”). This is supposed to be fixed in 220.127.116.11 (released because of our problem, it seems). The manager from EMC^2 which came because of this was a poor guy. Imagine him being there to represent EMC^2, me and my boss to represent the on-site admins of our client, one of the Windows people of our client, and 5 people from our client in the same room (this means 1 vs. 7)… and our client was not happy.
As we did not had a working bootstrap backup (do not ask, a mix of bad luck and strange behavior of NW), we had to do a disaster recovery (this means scanning all tapes with backups we did so far). The bad part is, this will result in some clients which are supposed to have the same client ID to show up with a different client ID. In plain english this means, the disaster recovery procedure does not really work as expected. Yes, we still can recover the data, but it is not just starting the recover program and making your choice what to restore. You have to first find the data in the media DB (with the mminfo tool), print out the new client ID, create an entry in /etc/hosts on the NW-server with a dummy name (I decided to give them the IP 127.0.1.X, in case we need it), create a client in NW with the same dummy name (e.g. z-recover-<client-name>) and then to recover your data normally. This is a bug for which we do not know yet if it is fixed in 18.104.22.168 or not. I can tell that this procedure works, we had to recover some data just after we where in a state where it was possible to recover at least some data (setup was OK, but not all tapes where scanned).
Additionally to the other bugs we stumbled upon (e.g. moving a VTL to the other storage node needs a restart of the NW-server, NW will moan about the license not being OK and not giving access to the VTL, if you do not restart; if the cloning contacts the media DB at the wrong time (when it is busy), the cloning will never start and stay in “server busy”-mode, even if it is not busy anymore; the postcmd is running one minute after the precmd, it is not detecting that a backup is still running; …) I have to say that NW 7.5.1 is not really production ready. For me NW 7.5.1 is beta quality software. During the time when I took care about fixing the problems with our backup system, the one thing I told most often to other people when they asked if I think the backup will be OK in the night was: please cross your fingers.
My colleague is back from holiday, and on Wednesday he is going to install 22.214.171.124. It is supposed to fix most of our problems. I cross my fingers.
I can also report some good things about NW. The GUI is much better than the one from 7.1.1. And when 7.5.x will be as stable as our 7.1.1 version was (there where some problems from time to time, but the workaround was to restart the server and it was OK, as this did not happen often, it was OK for us), it will be really a nice backup system which we have here. And boy, this thing is fast (OK, a part of this is because of the VTL instead of a physical tape library, but NW is now also able to read from several tapes in parallel for the same restore — if the data is on multiple tapes, off course).
Bottom line: if you use 7.5.x, update to 126.96.36.199 as soon as possible.
Tags: backup system, corrupt media, error problem, legato networker, meta data, point of no return, priority one, storage nodes, tape libraries, virtual tape —