legato networker | Alexander Leidinger

EMC^2/Legato Networker 7.5.1.6 status

We updated Networker 7.5.1.4 to 7.5.1.6 as the Networker-Support thought it will fix at least one of our problems (“ghost” volumes in the DB). Unfortunately the update does not fix any bug we see in our environment.

Specially for the “post-command runs 1 minute after pre-command even if the backup is not finished”-bug this is not satisfying: no consistent DB backup where the application has to be stopped together with the DB to get a consistent snapshot (FS+DB in sync).

Share/Save

EMC^2/Legato Networker 7.5.1.4 tests

Regarding our last problems with NW:

OK: the “restart NW-server directly after deleting a client with index entries”-crash is fixed
Mostly OK: shutting down a storage node does not crash the NW-server anymore… most of the time (sometimes there is some strange behavior in this regard, we do not have enough evidence, but there may be still some sleeping dragons)
?: we did not yet check the disaster recover part
NOK: the post-cmd is still run one minute after the pre-cmd in some cases, maybe this is related to a session/save-set which is not yet started but the pre-cmd is already run, if this is the case, this could maybe also affect the case where there is more than one minute of delay between the end of one session/save-set on a machine and the start of another session/save-set on the same machine (the support is investigating)
NOK: some Oracle-RMAN backups (custom save command, perl script) show a running session in the NW-monitoring and some do not, after the backup mminfo sometimes lists the group of a RMAN-save-set and sometimes not (for the same client), under investigation by the support

So, for us 7.5.1.4 is still a beta version.

Share/Save

EMC^2/Legato Networker 7.5.1.4 status

The update to 7.5.1.4 went fine. No major problems encountered. So far we did not see any regressions. The complete system feels a little bit more stable (no restarts necessary so far, before some where necessary from time to time). We still have to test all our problem cases:

restart NW-server directly after deleting a client with index entries (manual copy of /nsr needed before, in case the mediadb corruption bug is not fixed as promised)
shutdown a storage node to test if the NW-server still crashes in this case
start with an empty mediadb but populated clients (empty /nsr/mm, but untouched /nsr/res) and scan some tapes to check if “shadow clients” (my term for clients which have the same client ID but get newly created during the scanning with a new client ID and a name of “~<original-name>-<number>”) still get created instead of populating the index of the correct client

The first two ones are supposed to be fixed, the last one is maybe not fixed.

Not fixed (according to the support) is the problem of needing a restart of the NW-server when moving a tape library from one storage node to another storage node. It also seems that our problem with the manual cloning of save sets is not solved. There are still some clone processes which do not get out of the “server busy” loop, no matter how idle the NW-server is. In this case it can be seen that nsrclone is waiting in nanosleep (use pstack or dtrace to see it). The strange thing is, that a safe set which is “failing” with such behavior will always cause this behavior. We need to have a deeper look to see if we find similarities between such safe sets and differences to safe sets which can be cloned without problems.

Share/Save

EMC^2/Legato Networker 7.5.1 problems

In July a colleague switched our backup system from a Networker (NW) server 7.1.1 with a single physical tape library to a NW server 7.5.1 on a Solaris Cluster plus 2 storage nodes with 2 virtual tape libraries and 2 physical tape libraries. He specially decided to install 7.5.1 instead of 7.5 (going with 7.4 was not an option he had because of some requirements for the Sharepoint backup for our Windows people) in the beginning of the project several months ago, as he did not trust a “.0‑release”. Everything went well until we crossed the point of no return. Then we got a corrupt media DB (meta data which tells which backup for which client is on which tape) just 3 days before he went into holiday. Yeah! 🙁

As I was agreeing to be his backup and polish the NW setup during his holiday, this resulted in a lot of overtime for me to get everything back into shape. The bad part of this was, that the incident opened with the NW-support took about one a half day until it reached a point where someone was looking at trying to reproduce the core dumps we where getting. Until then the standard trial&error problem fixing procedure was done my the overseas support people (we had one support guy on-site, and he did really good support, and he was astonished about this behavior too). And this despite the fact the the help-request was on priority one. We also had clearly stated several times in the ticket, that our media DB was corrupt (with an empty media DB everything was working, just with the populated media DB the server crashed on startup). The problem was, that you are not allowed to delete a client which has entries in the media DB, without running “nsrim ‑X” by hand or taking a backup of the bootstrap (this runs “nsrim ‑X”). This is supposed to be fixed in 7.5.1.4 (released because of our problem, it seems). The manager from EMC^2 which came because of this was a poor guy. Imagine him being there to represent EMC^2, me and my boss to represent the on-site admins of our client, one of the Windows people of our client, and 5 people from our client in the same room (this means 1 vs. 7)… and our client was not happy.

As we did not had a working bootstrap backup (do not ask, a mix of bad luck and strange behavior of NW), we had to do a disaster recovery (this means scanning all tapes with backups we did so far). The bad part is, this will result in some clients which are supposed to have the same client ID to show up with a different client ID. In plain english this means, the disaster recovery procedure does not really work as expected. Yes, we still can recover the data, but it is not just starting the recover program and making your choice what to restore. You have to first find the data in the media DB (with the mminfo tool), print out the new client ID, create an entry in /etc/hosts on the NW-server with a dummy name (I decided to give them the IP 127.0.1.X, in case we need it), create a client in NW with the same dummy name (e.g. z‑recover-<client-name>) and then to recover your data normally. This is a bug for which we do not know yet if it is fixed in 7.5.1.4 or not. I can tell that this procedure works, we had to recover some data just after we where in a state where it was possible to recover at least some data (setup was OK, but not all tapes where scanned).

Additionally to the other bugs we stumbled upon (e.g. moving a VTL to the other storage node needs a restart of the NW-server, NW will moan about the license not being OK and not giving access to the VTL, if you do not restart; if the cloning contacts the media DB at the wrong time (when it is busy), the cloning will never start and stay in “server busy”-mode, even if it is not busy anymore; the postcmd is running one minute after the precmd, it is not detecting that a backup is still running; …) I have to say that NW 7.5.1 is not really production ready. For me NW 7.5.1 is beta quality software. During the time when I took care about fixing the problems with our backup system, the one thing I told most often to other people when they asked if I think the backup will be OK in the night was: please cross your fingers.

My colleague is back from holiday, and on Wednesday he is going to install 7.5.1.4. It is supposed to fix most of our problems. I cross my fingers.

I can also report some good things about NW. The GUI is much better than the one from 7.1.1. And when 7.5.x will be as stable as our 7.1.1 version was (there where some problems from time to time, but the workaround was to restart the server and it was OK, as this did not happen often, it was OK for us), it will be really a nice backup system which we have here. And boy, this thing is fast (OK, a part of this is because of the VTL instead of a physical tape library, but NW is now also able to read from several tapes in parallel for the same restore – if the data is on multiple tapes, off course).

Bottom line: if you use 7.5.x, update to 7.5.1.4 as soon as possible.

Share/Save

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30