Alexander Leidinger

Just another weblog

Jan
15

Com­plete net­work loss on Solaris 10u10 CPU 2012-10 on vir­tu­al­ized T4-2

The prob­lem I see at work: A T4-2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nec­tiv­ity “out of the blue” once, and maybe “spo­radic” directly after a cold boot. After a lot of dis­cus­sion with Ora­cle, I have the impres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the machine (no zone or guest LDOM or the pri­mary LDOM was able to have receive or send IP pack­ets). This hap­pened once. No idea how to repro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: exceeded num­ber of per­mit­ted hand­shake attempts (5) on chan­nel xxx”. Accord­ing to Ora­cle this is sup­posed to be fixed in 148677 – 01 which will come with Solaris 10u11. They sug­gested to use a vsw inter­face instead of a vnet inter­face on the pri­mary domain to at least lower the prob­a­bil­ity of this prob­lem hit­ting us. They were not able to tell us how to repro­duce the prob­lem (seems to be a race con­di­tion, at least I get this impres­sion based upon the descrip­tion of the Ora­cle engi­neer han­dling the SR). Only a reboot helped to get the prob­lem solved. I was told we are the only client which reported this kind of prob­lem, the patch for this prob­lem is based upon an inter­nal bugre­port from inter­nal tests.

2nd prob­lem:
After cold boots some­times some machines (not all) are not able to con­nect to an IP on the T4. A reboot helps, as does remov­ing an inter­face from an aggre­gate and directly adding it again (see below for the sys­tem con­fig). To try to repro­duce the prob­lem, we did a lot of warm reboots of the pri­mary domain, and the prob­lem never showed up. We did some cold reboots, and the prob­lem showed up once.

In case some­one else sees one of those prob­lems on his machines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe repro­duc­ing the problems.

Sys­tem setup:

  • T4-2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on addi­tional net­work card)
  • 3 guest LDOMs and one io+control domain (both in the pri­mary domain)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the pri­mary domain uses a mir­rored zpool on SSDs
  • 5 vswitch in the hypervisor
  • 4 aggre­gates (aggr1 — aggr4 with L2-policy), each one with one igb and one nxge NIC
  • each aggre­gate is con­nected to a sep­a­rate vswitch (the 5th vswitch is for machine-internal communication)
  • each guest LDOM has three vnets, each vnets con­nected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone com­mu­ni­ca­tion, all LDOMs are addi­tion­ally con­nected to the machine-internal-only vswitch via the 3rd vnet)
  • pri­mary domain uses 2 vnets con­nected to the vswitch which is con­nected to aggr2 and aggr3 (con­sis­tency with the other LDOMs on this machine) and has no zones
  • this means each entity (pri­mary domain, guest LDOMs and each zone) has two vnets in and those two vnets are con­fig­ured in a link-based IPMP setup (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­fig­ured in the hyper­vi­sor (with the zones being in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Ora­cle is to replace the 2 vnet inter­faces in the pri­mary domain with 2 vsw inter­faces (which means to do VLAN tag­ging in the pri­mary domain directly instead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same setup, on one sys­tem we already changed this and it is work­ing as before. As we don’t know how to repro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, respec­tively what the prob­a­bil­ity is to get hit again by this problem.

Ideas / sug­ges­tions / info welcome.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Nov
25

Which crypto card to use with FreeBSD (ssh/gpg)

The recent secu­rity inci­dent trig­gered a dis­cus­sion how to secure ssh/gpg keys.

One way I want to focus on here (because it is the way I want to use at home), is to store the keys on a crypto card. I did some research for suit­able crypto cards and found one which is called Feit­ian PKI Smart­card, and one which is called OpenPGP card. The OpenPGP card also exists in a USB ver­sion (basi­cally a small ver­sion of the card is already inte­grated into a small USB card reader).

The Feit­ian card is reported to be able to han­dle RSA keys upto 2048 bits. They do not seem to han­dle DSA (or ECDSA) keys. The smart­card quick starter guide they have  (the Tun­ing smart­card file sys­tem part) tells how to change the para­me­ters of the card to store upto 9 keys on it.

The spec of the OpenPGP card tells that it sup­ports RSA keys upto 3072 bits, but there are reports that it is able to han­dle RSA keys upto 4096 bits (you need to have at least GPG 2.0.18 to han­dle that big keys on the crypto card). It looks to me like the card is not han­dle DSA (or ECDSA) cards. There are only slots for upto 3 keys on it.

If I go this way, I would also need a card reader. It seems a class 3 one (hard­ware PIN pad and dis­play) would be the most “future-proof” way to go ahead. I found a Reiner SCT cyber­Jack sec­oder card reader, which is believed to be sup­ported by OpenSC and seems to be a good bal­ance between cost and fea­tures of the Reiner SCT card readers.

If any­one read­ing this can sug­gest a bet­ter crypto card (keys upto 4096 bits, more than 3 slots, and/or DSA/ECDSA  sup­port), or a bet­ter card reader, or has any prac­ti­cal expe­ri­ence with any of those com­po­nents on FreeBSD, please add a comment.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Sep
30

Forc­ing a route in Solaris?

I have a lit­tle prob­lem find­ing a clean solu­tion to the fol­low­ing problem.

A machine with two net­work inter­faces and no default route. The first inter­face gets an IP at boot time and the cor­re­spond­ing sta­tic route is inserted dur­ing boot into the rout­ing table with­out prob­lems. The sec­ond inter­face only gets an IP address when the shared-IP zones on the machine are started, dur­ing boot the inter­face is plumbed but with­out any address. The net­works on those inter­faces are not con­nected and the machine is not a gate­way (this means we have a machine–admin­is­tra­tion net­work and a production-network). The sta­tic routes we want to have for the addresses of the zones are not added to the rout­ing table, because the next hop is not reach­able at the time the routing-setup is done. As soon as the zones are up (and the inter­face gets an IP), a re-run of the routing-setup adds the miss­ing sta­tic routes.

Unfor­tu­nately I can not tell Solaris to keep the sta­tic route even if the next hop is not reach­able ATM (at least I have not found an option to the route com­mand which does this).

One solu­tion to this prob­lem would be to add an address at boot to the inter­face which does not have an address at boot-time ATM (prob­a­bly with the dep­re­cated flag set). The prob­lem is, that this sub­net (/28) has not enough free addresses any­more, so this is not an option.

Another solu­tion is to use a script which re-runs the routing-setup after the zones are started. This is a prag­matic solu­tion, but not a clean solution.

As I under­stand the in.routed man-page in.routed is not an option with the default con­fig, because the machine shall not route between the net­works, and shall not change the rout­ing based upon RIP mes­sages from other machines. Unfor­tu­nately I do not know enough about it to be sure, and I do not get the time to play around with this. I have seen some inter­st­ing options regard­ing this in the man-page, but play­ing around with this and sniff­ing the net­work to see what hap­pens, is not an option ATM. Any­one with a config/tutorial for this “do not broad­cast any­thing, do not accept any­thing from outside”-case (if possible)?

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,
Jan
10

Why are game console/TV com­pa­nies not imple­ment­ing this?

At the week­end a friend vis­ited me. We have not seen since each other since a long time. As we stud­ied both com­puter sci­ence, parts of our dis­cus­sion where off course tech­nol­ogy related. Parts of the dis­cus­sion where about cur­rent TV’s and game con­soles (he par­tic­i­pated in the design of the PS3 CPU, so he is well aware about the tech­ni­cal lim­i­ta­tions of the hard­ware the cur­rent game con­soles use).

Dur­ing our dis­cus­sion we talked about the soft­ware lim­i­ta­tions of such hardware.

Cur­rent TV’s come for exam­ple with some pre­de­fined inter­net chan­nels, but not with a real web browser. We think that peo­ple which keep a TV for 10 years or longer (like for exam­ple our par­ents and prob­a­bly both of us too) this will result in a loss of fea­tures after some years, because those chan­nels will get less atten­tion of case to exist at all. There is also no way to switch to alter­na­tives then, except by buy­ing a new TV (we expect that there will be no firmware update in such a case). With a real web browser this would not be an issue (it may be more easy to enter URL’s with a real key­board than with a remote con­trol, but let us do small steps here). Game con­soles are a bit bet­ter in this regard, but there we have the prob­lem that some web­sites are too much mem­ory hun­gry (they do not include the user agent of the game con­sole browsers in the same class as smart phones or tablet PCs… from the size aspect they are not, but from the mem­ory and com­put­ing power aspect they are more similar).

I would expect that the TV sta­tions do not want to have TVs with really good browsers, because then you may not need a TV sta­tion any­more. But this is what users would use if it would be there.

Another deficit is that there is not a mail pro­gram in game con­soles and TV’s. For writ­ing mails you need a real key­board, but for a quick check if there is mail (e.g. X unread mails, or maybe even dis­play­ing the sub­ject line of the emails) or maybe to just read with­out answer­ing a solu­tion with­out a key­board con­nected would already be enough.

I expect that con­sole man­u­fac­tur­ers do not want to spend money for some­thing peo­ple are not will­ing to give much money for, respec­tively for some­thing where they can not make money with (an email ser­vice from the con­sole com­pany would be another mail ser­vice addi­tional to the one for the PC and maybe addi­tional to the one of the smart phone… peo­ple do not need 10 email accounts, one is enough).

Another over­looked fea­ture is some kind of VoIP+Video fea­ture (at least for the game con­soles which have option­ally a cam­era, but IMO this is also pos­si­ble for the next gen­er­a­tion of TV’s with build-in web­cams). At least the offer­ings from Sony and Microsoft are pow­er­ful enough to come with some kind of video con­fer­enc­ing soft­ware. It does not mat­ter much if this is Skype or the Google ver­sion of this, or some other wide­spread one (MS surely wants to use their own stuff), it just has to be one which is in wide­spread use to be adopted by the people.This does not need to be in HD, even a small video would already be much more than what is avail­able ATM.

Basi­cally I gave the answer to my ques­tion (the title of this post­ing) myself (except for the video con­fer­enc­ing stuff)… but on the other hand this would be some­thing which could set a prod­uct apart from oth­ers. For the PS3 this may be now one of the things which could show up in the Home­brew scene, now that the secu­rity of the PS3 is com­pro­mised. For the Wii at least the email part could be eas­ily done. The rest… would have to catch up in case some­thing like this shows up for the PS3 and is used extensively.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,
Oct
08

Fight­ing with the Ora­cle Direc­tory Server 7 (DSEE7) on Solaris 10 update 9

After mov­ing our sec­ondary man­age­ment site (our team is split up into 2 dif­fer­ent loca­tions) to a new build­ing, we decided to clean-up some things. One of those things involves mov­ing the LDAP to a dif­fer­ent machine (more or less a new server for the new site, it is inde­pen­dent regard­ing LDAP/homes/… from the pri­mary site). While I am at it, I take the oppor­tu­nity to move from DSEE5 to DSEE7 (my pre­vi­ous post about the DSEE6 migra­tion was at the pri­mary site). This time I took the pack­age dis­tri­b­u­tion instead of the zip dis­tri­b­u­tion (the main rea­son is that I can get patch-listings with an auto­matic tool, and the sec­ondary man­age­ment site has no disaster-recovery require­ments for the appli­ca­tions… we just will setup a new sec­ondary site some­where else if necessary).

Here my expe­ri­ences with the instal­la­tion instruc­tions of DSEE7.

  • The install instruc­tions refer to the web inter­face for the DSEE7 man­age­ment, but I have not seen some­thing which tells you first have to setup an appli­ca­tion server (this was bet­ter in the DSEE6 instructions).
  • When using the Glass­fish appli­ca­tion server which comes with Solaris 10 for the web inter­face, you will get an excep­tion after deploy­ing the dscc7.war, as it is using an out­dated JVM. After some fight­ing and Googling, I found that I have to change the AS_JAVA value in /usr/appserver/config/asenv.conf to a more recent JVM as it is point­ing to the very out­dated j2se 1.4.x. I pointed it to /usr/java (which is a sym­link to the most recent ver­sion installed as a pack­age). Instead of the orig­i­nal excep­tion I got another one now (after a redi­rec­tion in the web-browser), some­thing that it can not find the AntMain class (Glass­fish uses ANT from /usr/sfw, this is the one which comes with Solaris 10 update 9). I tried with Java 5 instead of Java 6, but I get the same error. In the net there are some dis­cus­sions about such errors (it is even a FAQ at the ANT site), but this Glassfish/DSEE7 thing is a black box for me, so what am I sup­posed to do here (I do not want to put the sys­tem into an unof­fi­cial state by installing my own ANT for Glassfish/DSEE7)?
    It was not men­tioned in the Appen­dix of the DSEE7 install instruc­tions which explains how to install the .war in Glass­fish that you have to change to a more recent JVM, and I still fight with the AntMain prob­lem (hey Ora­cle, there is room for improve­ment in the prod­uct com­pat­i­bil­ity test­ing and doc­u­men­ta­tion ver­i­fi­ca­tion process).

I will update this post­ing when I make some advance­ments. For now I let the web inter­face in the bad state as it is and con­cen­trate on fin­ish­ing the LDAP move to the new sys­tem (installing an DSEE on a backup sys­tem, con­fig­ur­ing repli­ca­tion, switch­ing the clients to them). The web inter­face is inde­pen­dent enough to han­dle it later (hints wel­come, that is the main pur­pose why I write this pos­ing in the mid­dle of the work).

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,