Solaris 10/11(.3) boot panic/crash after mov­ing rpool to a new stor­age system

Sit­u­a­tion

The boot disks of some Solaris LDOMs were migrat­ed from one stor­age sys­tem to anoth­er one via ZFS mir­ror­ing the rpool to the new sys­tem and detach­ing the old LUN.

Issue

After reboot with on the new stor­age sys­tem Solaris 10 and 11(.3) pan­ic at boot.

Cause

  • rpool not on slice 0 but on slice 2
  • bug in Solaris when doing such a mir­ror and “just” doing a reboot <- this is the real issue, it seems Solaris can not han­dle a change of the name of the under­ly­ing device for a rpool, as just mov­ing the par­ti­tion­ing to slice 0 is not fix­ing the panic.

Fix

# boot from net­work (or an alter­nate pool which was not yet moved), import/export the pools, boot from the pools
boot net -
# go to shell
# if need­ed: change the par­ti­tion­ing so that slice 0 has the same val­ues as slice 2 (respec­tive­ly make sure the rpool is in slice 0)
zpool import ‑R /tmp/yyy rpool
zpool export rpool
reboot

 

New users in Solaris 10 brand­ed zones on Solaris 11 not han­dled automatically

A col­league noticed that on a Solaris 11 sys­tem a Solaris 10 brand­ed zone “gains” two new dae­mons which are run­ning with UID 16 and 17. Those users are not auto­mat­i­cal­ly added to /etc/passwd, /etc/shadow (and /etc/group)… at least not when the zones are import­ed from an exist­ing Solaris 10 zone.

I added the two users (netadm, netcfg) and the group (netadm) to the Solaris 10 brand­ed zones by hand (copy&paste of the lines in /etc/passwd, /etc/shadow, /etc/group + run pwconv) for our few Solaris 10 brand­ed zones on Solaris 11.

Increase of DNS requests after a crit­i­cal patch update of Solaris 10

Some weeks ago we installed crit­i­cal patch updates (CPU) on a Solaris 10 sys­tem (inter­nal sys­tem, a year of CPU to install, noth­ing in it affect­ing us or was con­sid­ered a secu­ri­ty risk, we decid­ed to apply this one regard­less to not fall behind too much). After­wards we noticed that two zones are doing a lot of DNS requests. We noticed this already before the zones went into pro­duc­tion and we con­fig­ured a pos­i­tive time to live in nscd.conf for “hosts”. Addi­tion­al­ly we noticed a lot of DNS requests for IPv6 address­es (AAAA lookups), while absolute­ly no IPv6 address is con­fig­ured in the zones (not even for local­host… and those are exclu­sive IP zones). Appar­ent­ly with one of the patch­es in the CPU the behav­iour changed regard­ing the caching, I am not sure if we had the AAAA lookups before.

Today I got some time to debug this. After adding caching of “ipn­odes” in addi­tion to “hosts” (and I con­fig­ured a neg­a­tive time to live for both at the same time), the DNS requests came down to a sane amount.

For the AAAA lookups I have not found a solu­tion. By my read­ing of the doc­u­men­ta­tion I would assume there are not IPv6 DNS lookups if there is not IPv6 address configured.

Sta­tus cryp­to cards HOWTO: prob­lems with the card read­er (sup­port could be better)

After hours (spread over weeks) I come to the con­clu­sion that there is a lot of poten­tial to improve the doc­u­men­ta­tion of card read­ers (but I doubt the card read­er ven­dors will do it) and of the pcsc doc­u­men­ta­tion. It is not easy to arrive at a point where you under­stand every­thing. The com­pat­i­bil­i­ty list does not help much, as the card read­ers are part­ly past their end of life and the mod­els which replace them are not list­ed. Respec­tive­ly the one I bought does not sup­port all the fea­tures I need. I even port­ed the dri­ver to FreeB­SD (not com­mit­ted, I want­ed to test every­thing first) and a lot of stuff works, but one crit­i­cal part is that I can not store a cer­tifi­cate on the cryp­to card as the card read­er or the dri­ver  does not sup­port extend­ed APDUs (need­ed to trans­fer more than 255 bytes to the card reader).

Well, the sta­tus so far:

  • I have a HOWTO what to install to use cryp­to cards in FreeBSD
  • I have a HOWOT what to install / con­fig­ure in Windows
  • I have a HOWTO regard­ing cre­at­ing keys on a openpgp v2 card and how to use this key with ssh on FreeB­SD (or any oth­er unix-like OS which can run pcsc)
  • I have a card read­er which does not sup­port extend­ed APDUs
  • I want to make sure what I write in the HOW­TOs is also suit­able for the use with Win­dows / PuTTY
  • it seems Win­dows needs a cer­tifi­cate and not only a key when using the Win­dows CAPI (using the ven­dor sup­plied card read­er dri­ver) in PuTTY-CSC (works at work with a USB token)
  • the pcsc pkcs11 Win­dows DLL is not suit­able yet for use on Win­dows 8 64bit
  • I con­tact­ed the card read­er ven­dor if the card read­er or the dri­ver is the prob­lem regard­ing the extend­ed APDUs
  • I found prob­lems in gpg4win / pcsc on Win­dows 8
  • I have send some mon­ey to the devel­op­ers of gpg4win to sup­port their work (if you use gnupg on Win­dows, try to send a few units of mon­ey to them, the work stag­nat­ed as they need to spend their time for paid work)

So either I need a new card read­er, or have to wait for an update of the lin­ux dri­ver of the ven­dor… which prob­a­bly means it may be a lot faster to buy a new card read­er. When look­ing for one with at least a PIN pad, I either do not find any­thing which is list­ed as sup­port­ed by pcsc on the ven­dor pages (it is incred­i­ble how hard it is to nav­i­gate the web­sites of some com­pa­nies… a lot of buzz­words but no way to get to the real prod­ucts), or they only list updat­ed mod­els where I do not know if they will work.

When I have some­thing which works with FreeB­SD and Win­dows, I will pub­lish all the HOW­TOs here at once.

Com­plete net­work loss on Solaris 10u10 CPU 2012-10 on vir­tu­al­ized T4‑2

The prob­lem I see at work: A T4‑2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nec­tiv­i­ty “out of the blue” once, and maybe “spo­radic” direct­ly after a cold boot. After a lot of dis­cus­sion with Ora­cle, I have the impres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the machine (no zone or guest LDOM or the pri­ma­ry LDOM was able to have receive or send IP pack­ets). This hap­pened once. No idea how to repro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: exceed­ed num­ber of per­mit­ted hand­shake attempts (5) on chan­nel xxx”. Accord­ing to Ora­cle this is sup­posed to be fixed in 148677 – 01 which will come with Solaris 10u11. They sug­gest­ed to use a vsw inter­face instead of a vnet inter­face on the pri­ma­ry domain to at least low­er the prob­a­bil­i­ty of this prob­lem hit­ting us. They were not able to tell us how to repro­duce the prob­lem (seems to be a race con­di­tion, at least I get this impres­sion based upon the descrip­tion of the Ora­cle engi­neer han­dling the SR). Only a reboot helped to get the prob­lem solved. I was told we are the only client which report­ed this kind of prob­lem, the patch for this prob­lem is based upon an inter­nal bugre­port from inter­nal tests.

2nd prob­lem:
After cold boots some­times some machines (not all) are not able to con­nect to an IP on the T4. A reboot helps, as does remov­ing an inter­face from an aggre­gate and direct­ly adding it again (see below for the sys­tem con­fig). To try to repro­duce the prob­lem, we did a lot of warm reboots of the pri­ma­ry domain, and the prob­lem nev­er showed up. We did some cold reboots, and the prob­lem showed up once.

In case some­one else sees one of those prob­lems on his machines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe repro­duc­ing the problems.

Sys­tem setup:

  • T4‑2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on addi­tion­al net­work card)
  • 3 guest LDOMs and one io+control domain (both in the pri­ma­ry domain)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the pri­ma­ry domain uses a mir­rored zpool on SSDs
  • 5 vswitch in the hypervisor
  • 4 aggre­gates (aggr1 – aggr4 with L2-policy), each one with one igb and one nxge NIC
  • each aggre­gate is con­nect­ed to a sep­a­rate vswitch (the 5th vswitch is for machine-internal communication)
  • each guest LDOM has three vnets, each vnets con­nect­ed to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone com­mu­ni­ca­tion, all LDOMs are addi­tion­al­ly con­nect­ed to the machine-internal-only vswitch via the 3rd vnet)
  • pri­ma­ry domain uses 2 vnets con­nect­ed to the vswitch which is con­nect­ed to aggr2 and aggr3 (con­sis­ten­cy with the oth­er LDOMs on this machine) and has no zones
  • this means each enti­ty (pri­ma­ry domain, guest LDOMs and each zone) has two vnets in and those two vnets are con­fig­ured in a link-based IPMP set­up (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­fig­ured in the hyper­vi­sor (with the zones being in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Ora­cle is to replace the 2 vnet inter­faces in the pri­ma­ry domain with 2 vsw inter­faces (which means to do VLAN tag­ging in the pri­ma­ry domain direct­ly instead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same set­up, on one sys­tem we already changed this and it is work­ing as before. As we don’t know how to repro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, respec­tive­ly what the prob­a­bil­i­ty is to get hit again by this problem.

Ideas / sug­ges­tions / info welcome.