Commercial | Alexander Leidinger

Solaris 10/11(.3) boot panic/crash after moving rpool to a new storage system

Situation

The boot disks of some Solaris LDOMs were migrated from one storage system to another one via ZFS mirroring the rpool to the new system and detaching the old LUN.

Issue

After reboot with on the new storage system Solaris 10 and 11(.3) panic at boot.

Cause

rpool not on slice 0 but on slice 2
bug in Solaris when doing such a mirror and “just” doing a reboot <- this is the real issue, it seems Solaris can not handle a change of the name of the underlying device for a rpool, as just moving the partitioning to slice 0 is not fixing the panic.

Fix

# boot from network (or an alternate pool which was not yet moved), import/export the pools, boot from the pools
boot net -
# go to shell
# if needed: change the partitioning so that slice 0 has the same values as slice 2 (respectively make sure the rpool is in slice 0)
zpool import ‑R /tmp/yyy rpool
zpool export rpool
reboot

Share/Save

New users in Solaris 10 branded zones on Solaris 11 not handled automatically

A colleague noticed that on a Solaris 11 system a Solaris 10 branded zone “gains” two new daemons which are running with UID 16 and 17. Those users are not automatically added to /etc/passwd, /etc/shadow (and /etc/group)… at least not when the zones are imported from an existing Solaris 10 zone.

I added the two users (netadm, netcfg) and the group (netadm) to the Solaris 10 branded zones by hand (copy&paste of the lines in /etc/passwd, /etc/shadow, /etc/group + run pwconv) for our few Solaris 10 branded zones on Solaris 11.

Share/Save

Increase of DNS requests after a critical patch update of Solaris 10

Some weeks ago we installed critical patch updates (CPU) on a Solaris 10 system (internal system, a year of CPU to install, nothing in it affecting us or was considered a security risk, we decided to apply this one regardless to not fall behind too much). Afterwards we noticed that two zones are doing a lot of DNS requests. We noticed this already before the zones went into production and we configured a positive time to live in nscd.conf for “hosts”. Additionally we noticed a lot of DNS requests for IPv6 addresses (AAAA lookups), while absolutely no IPv6 address is configured in the zones (not even for localhost… and those are exclusive IP zones). Apparently with one of the patches in the CPU the behaviour changed regarding the caching, I am not sure if we had the AAAA lookups before.

Today I got some time to debug this. After adding caching of “ipnodes” in addition to “hosts” (and I configured a negative time to live for both at the same time), the DNS requests came down to a sane amount.

For the AAAA lookups I have not found a solution. By my reading of the documentation I would assume there are not IPv6 DNS lookups if there is not IPv6 address configured.

Share/Save

Status crypto cards HOWTO: problems with the card reader (support could be better)

After hours (spread over weeks) I come to the conclusion that there is a lot of potential to improve the documentation of card readers (but I doubt the card reader vendors will do it) and of the pcsc documentation. It is not easy to arrive at a point where you understand everything. The compatibility list does not help much, as the card readers are partly past their end of life and the models which replace them are not listed. Respectively the one I bought does not support all the features I need. I even ported the driver to FreeBSD (not committed, I wanted to test everything first) and a lot of stuff works, but one critical part is that I can not store a certificate on the crypto card as the card reader or the driver does not support extended APDUs (needed to transfer more than 255 bytes to the card reader).

Well, the status so far:

I have a HOWTO what to install to use crypto cards in FreeBSD
I have a HOWOT what to install / configure in Windows
I have a HOWTO regarding creating keys on a openpgp v2 card and how to use this key with ssh on FreeBSD (or any other unix-like OS which can run pcsc)
I have a card reader which does not support extended APDUs
I want to make sure what I write in the HOWTOs is also suitable for the use with Windows / PuTTY
it seems Windows needs a certificate and not only a key when using the Windows CAPI (using the vendor supplied card reader driver) in PuTTY-CSC (works at work with a USB token)
the pcsc pkcs11 Windows DLL is not suitable yet for use on Windows 8 64bit
I contacted the card reader vendor if the card reader or the driver is the problem regarding the extended APDUs
I found problems in gpg4win / pcsc on Windows 8
I have send some money to the developers of gpg4win to support their work (if you use gnupg on Windows, try to send a few units of money to them, the work stagnated as they need to spend their time for paid work)

So either I need a new card reader, or have to wait for an update of the linux driver of the vendor… which probably means it may be a lot faster to buy a new card reader. When looking for one with at least a PIN pad, I either do not find anything which is listed as supported by pcsc on the vendor pages (it is incredible how hard it is to navigate the websites of some companies… a lot of buzzwords but no way to get to the real products), or they only list updated models where I do not know if they will work.

When I have something which works with FreeBSD and Windows, I will publish all the HOWTOs here at once.

Share/Save

Complete network loss on Solaris 10u10 CPU 2012-10 on virtualized T4‑2

The problem I see at work: A T4‑2 with 3 guest LDOMs, virtualized disks and networks lost the complete network connectivity “out of the blue” once, and maybe “sporadic” directly after a cold boot. After a lot of discussion with Oracle, I have the impression that we have two problems here.

1^st problem:
Total network loss of the machine (no zone or guest LDOM or the primary LDOM was able to have receive or send IP packets). This happened once. No idea how to reproduce it. In the logs we see the message “[ID 920994 kern.warning] WARNING: vnetX: exceeded number of permitted handshake attempts (5) on channel xxx”. According to Oracle this is supposed to be fixed in 148677 – 01 which will come with Solaris 10u11. They suggested to use a vsw interface instead of a vnet interface on the primary domain to at least lower the probability of this problem hitting us. They were not able to tell us how to reproduce the problem (seems to be a race condition, at least I get this impression based upon the description of the Oracle engineer handling the SR). Only a reboot helped to get the problem solved. I was told we are the only client which reported this kind of problem, the patch for this problem is based upon an internal bugreport from internal tests.

2^nd problem:
After cold boots sometimes some machines (not all) are not able to connect to an IP on the T4. A reboot helps, as does removing an interface from an aggregate and directly adding it again (see below for the system config). To try to reproduce the problem, we did a lot of warm reboots of the primary domain, and the problem never showed up. We did some cold reboots, and the problem showed up once.

In case someone else sees one of those problems on his machines too, please get in contact with me to see what we have in common to try to track this down further and to share info which may help in maybe reproducing the problems.

System setup:

T4‑2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on additional network card)
3 guest LDOMs and one io+control domain (both in the primary domain)
the guest LDOMs use SAN disks over the 4 HBAs
the primary domain uses a mirrored zpool on SSDs
5 vswitch in the hypervisor
4 aggregates (aggr1 – aggr4 with L2-policy), each one with one igb and one nxge NIC
each aggregate is connected to a separate vswitch (the 5^th vswitch is for machine-internal communication)
each guest LDOM has three vnets, each vnets connected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone communication, all LDOMs are additionally connected to the machine-internal-only vswitch via the 3^rd vnet)
primary domain uses 2 vnets connected to the vswitch which is connected to aggr2 and aggr3 (consistency with the other LDOMs on this machine) and has no zones
this means each entity (primary domain, guest LDOMs and each zone) has two vnets in and those two vnets are configured in a link-based IPMP setup (vnet-linkprop=phys-state)
each vnet has VLAN tagging configured in the hypervisor (with the zones being in different VLANs than the LDOMs)

The proposed change by Oracle is to replace the 2 vnet interfaces in the primary domain with 2 vsw interfaces (which means to do VLAN tagging in the primary domain directly instead of in the vnet config). To have IPMP working this means to have vsw-linkprop=phys-state. We have two systems with the same setup, on one system we already changed this and it is working as before. As we don’t know how to reproduce the 1^st problem, we don’t know if the problem is fixed or not, respectively what the probability is to get hit again by this problem.

Ideas / suggestions / info welcome.

Share/Save

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Sit­u­a­tion

Issue

Cause

Fix

Situation