We're experiencing a mysterious hang on shutdown of a newly-imaged Oracle/Sun Solaris 10 SPARC box. It is repeatable (in the same spot ... from what we can tell). We let it try to work itself out multiple times for 5-10 minutes and it never progressed.
I've never seen this happen before.
The last thing displayed on the console is that syslogd
was sent signal 15. Prior to us disabling snmpdx
on the box, the last thing on the console was that snmpdx
was sent signal 15 (after syslogd
was sent signal 15).
While very rare to find, in Solaris days past, I'd have a better idea from experience where the problem might be, and could then narrow things down further with silly (but effective) debugging echo
statments in /etc/*.d
scripts. With SMF in the picture, I'm not really quite sure where to start.
We forced a crash dump via sync
at the {ok}
prompt for later analysis, and then let the box come up because it's a production server and our scheduled outage window was closing.
/var/adm/messages
shows nothing of use.
How would you debug this situation? Is there a way to stop hiding what is happening at shutdown, to show, for example, each thing that is being done? (I never did like that a lot is hidden since Solaris 10 at boot-time too)
mdb
ps of the savecore shows the following processes were running at hang time (afsd is the OpenAFS client and that many are expected):
> > ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 0 0 0 0 0 0x00000001 00000000018387c0 sched
R 108 0 0 0 0 0x00020001 00000600110fe010 zpool-silmaril-p
R 3 0 0 0 0 0x00020001 0000060010b29848 fsflush
R 2 0 0 0 0 0x00020001 0000060010b2a468 pageout
R 1 0 0 0 0 0x4a024000 0000060010b2b088 init
R 1327 1 1327 329 0 0x4a024002 00000600176ab0c0 reboot
R 747 1 7 7 0 0x42020001 0000060017f9d0e0 afsd
R 749 1 7 7 0 0x42020001 00000600180104d0 afsd
R 752 1 7 7 0 0x42020001 0000060017cb44b8 afsd
R 754 1 7 7 0 0x42020001 0000060017fc8068 afsd
R 756 1 7 7 0 0x42020001 0000060017fcb0e8 afsd
R 760 1 7 7 0 0x42020001 00000600177f4048 afsd
R 762 1 7 7 0 0x42020001 000006001800f8b0 afsd
R 764 1 7 7 0 0x42020001 000006001800ec90 afsd
R 378 1 378 378 0 0x42020000 0000060013aee480 inetd
R 7 1 7 7 0 0x42020000 0000060010b28008 svc.startd
R 329 7 329 329 0 0x4a024000 00000600110ff850 sh
Z 317 7 317 317 0 0x4a014002 0000060013b3a490 sac
You can still do the old fashioned putting echos in the scripts when using SMF. Just go into /lib/svc/method and edit away. Just from that process list I'd say it was AFS related, but I haven't use that.
Two things come to my mind immediately. Firstly, I've had NFS do this when a client was shutting down and there was no server available.
Alternately, in debugging I'd look at using the -x
option to sh or ksh in the startup, assuming you can get at the scripts still.
/var/svc/log contains log files for all services that fall under SMF control. That's at least a starting point for debugging issues with SMF processes.
I agree with David, check there are no NFS mounts from this server that might be held open and that this server does not mount other NFS filesystems that may not be available.
It's been a while, but from what I recall, if you boot the box with the kernel arguments -v -m verbose, you'll get kernel messages (-v) and SVCS messages (-m verbose) displayed to the console. At least that way, you could get a better idea what's it trying to do...
tail /var/svc/log/rc6.log helps if you performed an init 6. However, any instance of fmd problems can cause it to hang.
User contributions licensed under CC BY-SA 3.0