How to debug silent hang on shutdown of Solaris 10?

1

We're experiencing a mysterious hang on shutdown of a newly-imaged Oracle/Sun Solaris 10 SPARC box. It is repeatable (in the same spot ... from what we can tell). We let it try to work itself out multiple times for 5-10 minutes and it never progressed.

I've never seen this happen before.

The last thing displayed on the console is that syslogd was sent signal 15. Prior to us disabling snmpdx on the box, the last thing on the console was that snmpdx was sent signal 15 (after syslogd was sent signal 15).

While very rare to find, in Solaris days past, I'd have a better idea from experience where the problem might be, and could then narrow things down further with silly (but effective) debugging echo statments in /etc/*.d scripts. With SMF in the picture, I'm not really quite sure where to start.

We forced a crash dump via sync at the {ok} prompt for later analysis, and then let the box come up because it's a production server and our scheduled outage window was closing.

/var/adm/messages shows nothing of use.

How would you debug this situation? Is there a way to stop hiding what is happening at shutdown, to show, for example, each thing that is being done? (I never did like that a lot is hidden since Solaris 10 at boot-time too)

mdb ps of the savecore shows the following processes were running at hang time (afsd is the OpenAFS client and that many are expected):

> > ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R      0      0      0      0      0 0x00000001 00000000018387c0 sched
R    108      0      0      0      0 0x00020001 00000600110fe010 zpool-silmaril-p
R      3      0      0      0      0 0x00020001 0000060010b29848 fsflush
R      2      0      0      0      0 0x00020001 0000060010b2a468 pageout
R      1      0      0      0      0 0x4a024000 0000060010b2b088 init
R   1327      1   1327    329      0 0x4a024002 00000600176ab0c0 reboot
R    747      1      7      7      0 0x42020001 0000060017f9d0e0 afsd
R    749      1      7      7      0 0x42020001 00000600180104d0 afsd
R    752      1      7      7      0 0x42020001 0000060017cb44b8 afsd
R    754      1      7      7      0 0x42020001 0000060017fc8068 afsd
R    756      1      7      7      0 0x42020001 0000060017fcb0e8 afsd
R    760      1      7      7      0 0x42020001 00000600177f4048 afsd
R    762      1      7      7      0 0x42020001 000006001800f8b0 afsd
R    764      1      7      7      0 0x42020001 000006001800ec90 afsd
R    378      1    378    378      0 0x42020000 0000060013aee480 inetd
R      7      1      7      7      0 0x42020000 0000060010b28008 svc.startd
R    329      7    329    329      0 0x4a024000 00000600110ff850 sh
Z    317      7    317    317      0 0x4a014002 0000060013b3a490 sac 
unix
solaris
asked on Server Fault Feb 21, 2011 by jblaine • edited Feb 23, 2011 by jblaine

5 Answers

0

You can still do the old fashioned putting echos in the scripts when using SMF. Just go into /lib/svc/method and edit away. Just from that process list I'd say it was AFS related, but I haven't use that.

answered on Server Fault Feb 21, 2011 by JOTN
0

Two things come to my mind immediately. Firstly, I've had NFS do this when a client was shutting down and there was no server available.

Alternately, in debugging I'd look at using the -x option to sh or ksh in the startup, assuming you can get at the scripts still.

answered on Server Fault Feb 21, 2011 by Mei
0

/var/svc/log contains log files for all services that fall under SMF control. That's at least a starting point for debugging issues with SMF processes.

I agree with David, check there are no NFS mounts from this server that might be held open and that this server does not mount other NFS filesystems that may not be available.

answered on Server Fault Feb 22, 2011 by Shaun Dewberry
0

It's been a while, but from what I recall, if you boot the box with the kernel arguments -v -m verbose, you'll get kernel messages (-v) and SVCS messages (-m verbose) displayed to the console. At least that way, you could get a better idea what's it trying to do...

answered on Server Fault Feb 23, 2011 by Mark Round
0

tail /var/svc/log/rc6.log helps if you performed an init 6. However, any instance of fmd problems can cause it to hang.

answered on Server Fault Dec 13, 2013 by jpy

User contributions licensed under CC BY-SA 3.0