Professional Documents
Culture Documents
Healthy Dose of MDB
Healthy Dose of MDB
Healthy Dose of MDB
Since the past several months, we have been encountering a baffling problem with Solaris 10 systems hanging on
their way up after a reboot (or simply a boot) in the multi-user mode (run level 3). Since we do our system
patching at regular intervals (and since solaris runs so well that we don"t need to reboot our servers otherwise),
we didn"t notice this until we came upon our latest patch cycle last month.
We noticed that our systems were not booting up the normal way. The only way to boot was to "trick" the system
into booting into single-user mode and then exiting out to the multi-user run-level.
This was a problem happening only on systems running Solaris 10, with Veritas Volume Manager 5.0 MP1. So
naturally we leaned towards tackling this first as a possible VxVM problem (introduced by some patch during the
patch cycle). And the patch rev we had decided to apply had the Veritas Storage Foundation 5.0 MP1 RP4 in it.
Description:
Systems running Solaris 10 and Veritas Volume Manager tend to hang when booted with a normal init 6 (ie while
booting into Run Level 3).
Workaround:
Boot the system in run level S (single user mode) and then exit out of it to boot into multi-user mode.
Details:
Opened a case with Veritas and sent them VRTSexplorer outputs and copies of our messages file from problem
host nodeA. After several iterations of generating explorers and veritas configuration information, Veritas still
didn"t have anything substantial they could pin this issue on. They asked me to generate a crash dump of the
hanging system (usually should be able to coredump a running host by breaking it and running "sync" from the
OBP). After repeated attempts at generating the core, I was unable to do so. It seems like the system hangs before
the dump device is initialized/configured by the OS. Using the work around, I enabled the solaris deadman timer
(which incidentally we should have on all our servers). This involves setting the following line in the /etc/system
file "
* set snooping = 1
What the deadman timer does is sents a high-priority hardware interrupt to each CPU (or Core or strand
depending on the platform) and updates a counter upon successful response by the CPU. In case of a system being
hung (due to hardware issues especially), this count might not increase with an increase in clock tick (interrupts
aresent to CPUs every tick). When this counter is not incremented, the kernel panics and kills itself. This didn"t
work because we didn"t encounter that kind of a problem (but is a good idea to have enabled nonetheless).
Boot the server in kernel debug mode (in SPARC systems it is done by running boot -k from the OBP). This loads
the kmdb module into the kernel as it boots up. Breaking the system while in kernel debug mode will not drop it
into the OBP but instead launch an mdb interface.
1
Healthy_dose_of_mdb
$<systemdump
[17]> $ <systemdump
panic[cpu17]/thread=2a101a3fca0: BAD TRAP: type=9 rp=2a101a3f760 addr=0 mmu_fsr=0
So back to the kernel debug mode again. This time I decided to try and investigate the kernel (you can do this
against a coredump as well).
[17]> ::msgbuf
MESSAGE
/pci@780/pci@0/pci@9/scsi@0 (mpt0):
mpt0 supports power management.
/pci@780/pci@0/pci@9/scsi@0 (mpt0):
DMA restricted to lower 4GB due to errata
/pci@780/pci@0/pci@9/scsi@0 (mpt0):
mpt0 Firmware version v1.9.0.0 (IR)
/pci@780/pci@0/pci@9/scsi@0 (mpt0):
mpt0: IOC Operational.
/pci@780/pci@0/pci@9/scsi@0 (mpt0):
mpt0: Initiator WWNs: 0×5080020000262858-0×508002000026285b
PCI-device: scsi@0, mpt0
mpt0 is /pci@780/pci@0/pci@9/scsi@0
sd1 at mpt0: target 0 lun 0
sd1 is /pci@780/pci@0/pci@9/scsi@0/sd@0,0
/pci@780/pci@0/pci@9/scsi@0/sd@0,0 (sd1) online
root on /pci@780/pci@0/pci@9/scsi@0/disk@0,0:a fstype ufs
2
Healthy_dose_of_mdb
px1 at root: 0×7c0 0×0
px1 is /pci@7c0
PCI Express-device: pci@0, pxb_plx5
pxb_plx5 is /pci@7c0/pci@0
PCI-device: pci@1, pxb_plx6
pxb_plx6 is /pci@7c0/pci@0/pci@1
PCI-device: pci@0, px_pci0
px_pci0 is /pci@7c0/pci@0/pci@1/pci@0
PCI-device: ide@8, uata0
uata0 is /pci@7c0/pci@0/pci@1/pci@0/ide@8
WARNING: px1: spurious interrupt from ino 0×4
uata-0#0
3
Healthy_dose_of_mdb
cpu13: UltraSPARC-T1 (cpuid 13 clock 1000 MHz)
cpu13 initialization complete - online
cpu14: UltraSPARC-T1 (cpuid 14 clock 1000 MHz)
cpu14 initialization complete - online
cpu15: UltraSPARC-T1 (cpuid 15 clock 1000 MHz)
cpu15 initialization complete - online
cpu16: UltraSPARC-T1 (cpuid 16 clock 1000 MHz)
cpu16 initialization complete - online
cpu17: UltraSPARC-T1 (cpuid 17 clock 1000 MHz)
cpu17 initialization complete - online
cpu18: UltraSPARC-T1 (cpuid 18 clock 1000 MHz)
cpu18 initialization complete - online
cpu19: UltraSPARC-T1 (cpuid 19 clock 1000 MHz)
cpu19 initialization complete - online
cpu20: UltraSPARC-T1 (cpuid 20 clock 1000 MHz)
cpu20 initialization complete - online
cpu21: UltraSPARC-T1 (cpuid 21 clock 1000 MHz)
cpu21 initialization complete - online
cpu22: UltraSPARC-T1 (cpuid 22 clock 1000 MHz)
cpu22 initialization complete - online
cpu23: UltraSPARC-T1 (cpuid 23 clock 1000 MHz)
cpu23 initialization complete - online
USB 1.10 device (usb3eb,3301) operating at full speed (USB 1.x) on USB 1.10 root
hub: hub@1, hubd1 at bus address 2
hubd1 is /pci@7c0/pci@0/pci@1/pci@0/usb@6/hub@1
/pci@7c0/pci@0/pci@1/pci@0/usb@6/hub@1 (hubd1) online
PCI-device: pci@0,2, px_pci1
px_pci1 is /pci@7c0/pci@0/pci@1/pci@0,2
PCI-device: pci@1, pxb_plx1
pxb_plx1 is /pci@780/pci@0/pci@1
PCI-device: pci@2, pxb_plx2
pxb_plx2 is /pci@780/pci@0/pci@2
PCI-device: pci@8, pxb_plx3
pxb_plx3 is /pci@780/pci@0/pci@8
PCI-device: pci@2, pxb_plx7
pxb_plx7 is /pci@7c0/pci@0/pci@2
PCI-device: pci@8, pxb_plx8
pxb_plx8 is /pci@7c0/pci@0/pci@8
PCI-device: pci@9, pxb_plx9
pxb_plx9 is /pci@7c0/pci@0/pci@9
NOTICE: e1000g0 registered
Intel(R) PRO/1000 Network Connection, Driver Ver. 5.1.11
NOTICE: pciex8086,105e - e1000g[0] : Adapter copper link is down.
NOTICE: e1000g1 registered
Intel(R) PRO/1000 Network Connection, Driver Ver. 5.1.11
NOTICE: pciex8086,105e - e1000g[1] : Adapter copper link is down.
NOTICE: e1000g3 registered
Intel(R) PRO/1000 Network Connection, Driver Ver. 5.1.11
NOTICE: pciex8086,105e - e1000g[3] : Adapter copper link is down.
NOTICE: pciex8086,105e - e1000g[0] : Adapter 1000Mbps full duplex copper link is
up.
NOTICE: pciex8086,105e - e1000g[1] : Adapter 1000Mbps full duplex copper link is
up.
pseudo-device: devinfo0
devinfo0 is /pseudo/devinfo@0
NOTICE: pciex8086,105e - e1000g[3] : Adapter 1000Mbps full duplex copper link is
up.
NOTICE: VxVM vxdmp V-5-0-34 added disk array DISKS, datype = Disk
NOTICE: VxVM vxdmp V-5-3-1700 dmpnode 300/0×0 has migrated from enclosure FAKE_E
NCLR_SNO to enclosure DISKS
4
Healthy_dose_of_mdb
NOTICE: e1000g2 registered
Intel(R) PRO/1000 Network Connection, Driver Ver. 5.1.11
[17]> ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 0 0 0 0 0 0×00000001 0000000001859480 sched
R 3 0 0 0 0 0×00020001 000006001172d838 fsflush
R 2 0 0 0 0 0×00020001 000006001172e450 pageout
R 1 0 0 0 0 0×4a004000 000006001172f068 init
R 79 1 78 78 0 0×42010000 0000060017f30028 ssmagent.bin
R 9 1 9 9 0 0×42000000 000006001292f070 svc.configd
R 7 1 7 7 0 0×42000000 000006001172c008 svc.startd
R 54 7 7 7 0 0×4a004000 0000060017e43080 vxvm-sysboot
R 56 54 7 7 0 0×4a004000 0000060017e42468 vxconfigd
R 57 56 57 57 0 0×42020000 0000060017dc8018 vxconfigd
[17]> 0000060017e43080::findstack -v
kmdb: thread 60017e43080 isn"t in memory
[17]> 0000060017e43080::walk thread
30003eba480
[17]> 0000060017e43080::walk thread| ::findstack -v
stack pointer for thread 30003eba480: 2a101514ff1
[ 000002a101514ff1 cv_wait_sig_swap_core+0x130() ]
000002a1015150a1 waitid+0×484(0, 60017e42468, 0, 60017e430e8, 0, 1)
000002a101515171 waitsys32+0×10(0, 38, ffbffa80, 83, 39590, 3a57c)
000002a1015152e1 syscall_trap32+0xcc(0, 38, ffbffa80, 83, 39590, 3a57c)
[17]>
5
Healthy_dose_of_mdb
Found something in the last one, but no immediate red flags. But then I saw this (ssmagent running). SSM Agent
is a SNMP agent that is used to monitor our systems and report back to Netcool, our Fault Management server.
[17]> ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 0 0 0 0 0 0×00000001 0000000001859480 sched
R 3 0 0 0 0 0×00020001 000006001172d838 fsflush
R 2 0 0 0 0 0×00020001 000006001172e450 pageout
R 1 0 0 0 0 0×4a004000 000006001172f068 init
R 79 1 78 78 0 0×42010000 0000060017f5e028 ssmagent.bin
R 43 1 42 42 0 0×42020000 0000060017db7078 dhcpagent
R 9 1 9 9 0 0×42000000 0000060012922458 svc.configd
R 7 1 7 7 0 0×42000000 000006001172c008 svc.startd
R 54 7 7 7 0 0×4a004000 000006001172cc20 vxvm-sysboot
R 56 54 7 7 0 0×4a004000 0000060017e10468 vxconfigd
R 57 56 57 57 0 0×42020000 0000060017db4018 vxconfigd
[17]> 0000060017f5e028::walk thread| ::findstack -v
stack pointer for thread 30003fb55a0: 2a101e02961
[ 000002a101e02961 sema_p+0x130() ]
000002a101e02a11 biowait+0×6c(60017dda100, 0, 1870000, 30003dbe000, 2000,
60017dda100)
000002a101e02ac1 ufs`ufs_getpage_miss+0×2ec(60019a0f300, 40000, 4de,
600129cea20, fdba0000, 2a101e03760)
000002a101e02bc1 ufs`ufs_getpage+0×694(300014b7e00, 40000, 1, 0, 1, 3)
000002a101e02d21 fop_getpage+0×44(60019a0f300, 600114659c0, 60011403978, 3,
fdba0000, 3)
000002a101e02df1 segvn_fault+0xb04(8000, 600129cea20, 3, 2000, 40000, 0)
000002a101e02fc1 as_fault+0×4c8(600129cea20, 600129d9200, fdba0000,
60011736320, 189eb00, 0)
000002a101e030d1 pagefault+0×68(fdba14a8, 0, 3, 0, 60017f5e028, 600117362a8)
000002a101e03191 trap+0xd50(2a101e03b90, 10000, 0, 3, fdba14a8, 0)
000002a101e032e1 utl0+0×4c(ff3f40fc, ff3f5a70, 1, 0, ff3f4910, 821)
stack pointer for thread 30003f08820: 2a102005091
[ 000002a102005091 cv_wait_sig_swap_core+0x130() ]
000002a102005141 lwp_park+0×130(0, 1, 30003f089c6, 30003f08820, 0, 100000)
000002a102005231 syslwp_park+0×54(0, 0, 0, 0, ff092010, 1)
000002a1020052e1 syscall_trap32+0xcc(0, 0, 0, 0, ff092010, 1)
stack pointer for thread 30003fb4560: 2a101d9f091
[ 000002a101d9f091 cv_wait_sig_swap_core+0x130() ]
000002a101d9f141 lwp_park+0×130(0, 1, 30003fb4706, 30003fb4560, 0, 100000)
000002a101d9f231 syslwp_park+0×54(0, 0, 0, 0, ff092020, 1)
000002a101d9f2e1 syscall_trap32+0xcc(0, 0, 0, 0, ff092020, 1)
stack pointer for thread 30003ef77c0: 2a1008e9091
[ 000002a1008e9091 cv_wait_sig_swap_core+0x130() ]
000002a1008e9141 lwp_park+0×130(0, 1, 30003ef7966, 30003ef77c0, 0, 100000)
000002a1008e9231 syslwp_park+0×54(0, 0, 0, 0, ff092030, 1)
000002a1008e92e1 syscall_trap32+0xcc(0, 0, 0, 0, ff092030, 1)
stack pointer for thread 30003fd35c0: 2a101db6f91
[ 000002a101db6f91 cv_timedwait_sig+0x16c() ]
000002a101db7041 cv_waituntil_sig+0×8c(60017dc8592, 60017dc8558, 2a101db7ad0,
2, 18f0800, 2)
000002a101db7111 poll_common+0×4e8(60012996580, 60017f5e028, 2a101db7ad0, 0,
fe57bcd0, 2)
000002a101db7201 pollsys+0xf8(fe57bcd0, 1, fe57bd70, 0, 2a101db7ad0, 0)
000002a101db72e1 syscall_trap32+0xcc(fe57bcd0, 1, fe57bd70, 0, fe57bd70, 0)
stack pointer for thread 30003efa460: 2a101df3091
[ 000002a101df3091 cv_wait_sig_swap_core+0x130() ]
000002a101df3141 lwp_park+0×130(0, 1, 30003efa606, 30003efa460, 0, 100000)
000002a101df3231 syslwp_park+0×54(0, 0, 0, 0, ff092050, 1)
000002a101df32e1 syscall_trap32+0xcc(0, 0, 0, 0, ff092050, 1)
stack pointer for thread 30003fe55e0: 2a10089efc1
6
Healthy_dose_of_mdb
[ 000002a10089efc1 cv_timedwait_sig+0x16c() ]
000002a10089f071 cv_waituntil_sig+0×8c(30003fe5786, 30003fe5788, 2a10089fa10,
2, 18f0800, 2)
000002a10089f141 lwp_park+0×130(fdffbd50, 0, 30003fe5786, 30003fe55e0, 0,
100000)
000002a10089f231 syslwp_park+0×54(0, fdffbd50, 0, 0, ff092060, 1)
000002a10089f2e1 syscall_trap32+0xcc(0, fdffbd50, 0, 0, ff092060, 1)
stack pointer for thread 30003ff4a20: 2a10083f091
[ 000002a10083f091 cv_wait_sig_swap_core+0x130() ]
000002a10083f141 lwp_park+0×130(0, 0, 30003ff4bc6, 30003ff4a20, 0, 100000)
000002a10083f231 syslwp_park+0×54(0, 0, 0, 0, ff092070, 1)
000002a10083f2e1 syscall_trap32+0xcc(0, 0, 0, 0, ff092070, 1)
[17]>
It wasn"t vxconfigd, but was ssmagent (a Netcool/Micromuse/IBM SNMP monitoring agent) instead that was
sitting in a blocked i/o wait state (biowait) and effectively preventing VxVM from starting up the rootdg and
mounting the encapsulated volumes. An svcadm disable ssmagent from the running OS, fixed the problem and
system now boots just fine.