Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Solaris™ Crash Analysis Tool

Sun Microsystems, Inc.


901 San Antonio Road
Palo Alto, CA 94303
U.S.A. 650-960-1300

October 2002
Copyright © 2002 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved.Patents Pending.U.S.
Government Rights - Commercial software. Government users are subject to the Sun Microsystems, Inc. standard license agreement and
applicable provisions of the FAR and its supplements.

Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in
the U.S. and in other countries, exclusively licensed through X/Open Company, Ltd.

Sun, Sun Microsystems, the Sun logo, Java and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and
other countries.

All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other
countries. Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems, Inc.

Products covered by and information contained in this service manual are controlled by U.S. Export Control laws and may be subject to the
export or import laws in other countries. Nuclear, missile, chemical biological weapons or nuclear maritime end uses or end users, whether
direct or indirect, are strictly prohibited. Export or reexport to countries subject to U.S. embargo or to entities identified on U.S. export exclusion
lists, including, but not limited to, the denied persons and specially designated nationals lists is strictly prohibited.

DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT,
ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Please
Recycle
Contents

Preface 5

1 Getting Started with Solaris CAT 8


Firing up the tool 8
Customizing the environment 9
Getting help 10
What now? 11

2 Examining a Running Kernel 12


What am I looking for? 12
Types, sizes and the header files 12
Writing variables on the fly 15
Making your own corefile 16

3 Examining corefiles from a system crash 18


What am I looking for? 18
Matching a known bug 18
Filesystem corruption 20
Bad Traps 22
Flipped bits 26
Stack Overflows 29

Contents 3
4 Examining corefiles from a system hang 35
What am I looking for? 35
Deadlock 35
Memory Exhaustion 39
Runaway Processes 41

4 BookTitle
Preface

The SolarisTM Crash Analysis Tool (Solaris CAT) is an analysis tool for examining post mortem corefiles
and running Solaris kernels. The purpose of this document is to familiarize the reader with it’s
capabilities by providing annotated examples of some more common Solaris CAT uses. Some knowledge
of Solaris internals along with Solaris system administration is assumed.

Should You Read This Book?


This book will be useful to those interested in determining what’s happening in the Solaris kernel. System
administrators will learn how to get the current value of kernel parameters that affect performance.
Administrators and driver developers alike will learn to examine Solaris system corefiles to determine
why their system crashed or hung. This document will also help the curious gain a better understanding
of how Solaris and the SPARC architecture work.

This document assumes a basic knowledge of SPARC architecture and assembly language, as well as
Solaris internals. See “Related Documentation” for helpful references in these areas.

How This Book Is Organized


Chapter 1 - How to get started with the analysis tool.
Chapter 2 - Using Solaris CAT to examine and modify the contents of a kernel thats up and running.
Chapter 3 - Several examples detailing corefile analysis from system crashes.
Chapter 4 - Several examples detailing corefile analysis from system hangs.

5
Typographic Conventions
Typeface Meaning Examples

AaBbCc123 The names of commands, files, Edit your .login file.


and directories; on-screen Use ls -a to list all files.
computer output % You have mail.

AaBbCc123 What you type, when % su


contrasted with on-screen Password:
computer output

AaBbCc123 Book titles, new words or Read Chapter 6 in the User’s Guide.
terms, words to be emphasized These are called class options.
You must be superuser to do this.

Command-line variable; To delete a file, type rm filename.


replace with a real name or
value

Shell Prompts
Shell Prompt

C shell machine_name%
C shell superuser machine_name#
Bourne shell and Korn shell $
Bourne shell and Korn shell superuser #

Related Documentation
■ Drake, Chris and Brown Kimberly. PANIC! UNIX System Crash Dump Analysis.

6 Solaris Corefile Analysis Tool


SunSoft Press, 1995. ISBN 0-13-149386-8
■ Mauro, Jim and McDougall, Richard. Solaris Internals: Core Kernel Architecture. Sun
Microsystems Press, 2001. ISBN 0-13-022496-0
■ Paul, Richard. SPARC Architecture, Assembly Language Programming, and C 2nd
Edition. Prentice Hall, 200. ISBN 0-13-025596-3
■ The SPARC Architecture Manual, Version 9. Prentice Hall, 1998. ISBN 0-13-09927-5
■ Solaris Tunable Parameters Reference Manual. Sun Microsystems, 2002.

Accessing Sun Documentation Online


The docs.sun.comsm web site enables you to access Sun technical documentation on the Web. You can
browse the docs.sun.com archive or search for a specific book title or subject at: http://docs.sun.com

Preface 7
CHAPTER 1

Getting Started with Solaris CAT

Since you’re reading this, it seems likely that the package installation instructions are a little redundant
So now that you’ve got everything installed, lets look at what Solaris CAT can do.

Firing Up the Tool


When a Solaris system panics, it generates two files for use in debugging. The unix.X and vmcore.X files
(where X is a number incremented each time a panic occurs) are typically located in /var/crash/‘uname
-n‘. You can use Solaris CAT to examine these corefiles, or a live system.

To look at a corefile, simply invoke the tool with the corefile number you’re interested in:

# cd /var/crash/moonstone
# /opt/SUNWscat/bin/scat 0
[copyright message omitted]
core file: /var/crash/moonstone/vmcore.0
user: Super-User (root:0)
release: 5.9 (64-bit)
version: Generic
machine: sun4u
node name: moonstone
domain: foo.com
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
hostid: 80800000
boothowto: 0x40 (DEBUG)
time of crash: Mon Aug 19 16:55:51 MDT 2002 (core is 38 days old)
age of system: 20 days 9 hours 9 minutes 31.92 seconds
panic cpu: 0 (ncpus: 2)
panic string: sync initiated

Getting Started with Solaris CAT 8


running sanity checks.../etc/system...per-cpu...misc...done

scat4.0beta2(vmcore.0):0>

The above command leaves you at a prompt that works very much like a shell. The information printed
before the initial prompt is gathered from the corefile and presented here, because it’s information you’d
likely want at the start. The prompt indicates which file you’re currently looking at, and gives a history
number. The most interesting piece of information above is the panic string. Here, it indicates that the
corefile was generated by typing ‘sync’ at the ok prompt. In older releases of Solaris, the panic string
would simply have bee ‘zero.’
To use Solaris CAT to examine a currently running kernel simply invoke it with no arguments:

# /opt/SUNWscat/bin/scat

The only initial differences will be these:

files: /dev/ksyms /dev/kmem


time in kernel: Fri Sep 27 00:12:19 MDT 2002
age of system: 24 days 16 hours 3 minutes 9.29 seconds

scat4.0beta2(live):1>

Note that there is no panic string, and the prompt is reminding you that you’re looking at the running
kernel.

Customizing the Environment


As you become more familiar Solaris CAT, you can modify it’s appearance and behavior with two files
that are read when the tool is first started up. The ~/.scatinit file is used for setting variables, aliases,
and modifying the appearance of the tool itself. This file is read after Solaris CAT is started, but before the
corefile or running kernel is opened.

% cat ~/.scatinit
set -o vi
alias more=less
alias l=’ls -l’
alias lr=’ls -lart’
scatenv dis_synth_only on
scatenv thr_flags off
scatenv thr_lwp off
scatenv thr_cpu off
scatenv thr_syscall off
scatenv stk_switch on

9 Solaris Corefile Analysis Tool


scatenv stk_s_addr off
scatenv stk_s_args off
scatenv stk_l_sym off
scatenv sym_module on
scatenv stabs_charp on
scatenv stk_l_symonly off

The scatenv command modifies the behavior of Solaris CAT, and will explain what all the variables do
when launched with no arguments. The ~/.scatstartup file is used to run commands at startup time, after
the core image has been opened. For example if you wanted to automatically dump out the message
buffer for every corefile you opened, simply do this.

% cat ~/.scatstartup
msgbuf

Note – These two startup files were split from the previously used ~/.scatrc file.
Users who have a ~/.scatrc file will need to split the contents out accordingly.

Getting Help
Solaris CAT has several resources available to help explore the command set.

Online Help
The online help will provide a list of all commands when the help command is invoked without options.
The help engine also supports regular expressions with the help -? regex syntax.

HTML documentation
These documents (starting with guide.html) demonstrate all the commands and are available in the
SUNWscat package.

Email alias
The SolarisCAT_Feedback@sun.com email alias is available for questions, bug reports and requests for
enhancements.

Note – The SolarisCAT_Feedback@sun email alias is strictly for feedback regarding


the tool. It is not intended to be an avenue for support in corefile analysis.

Web based Support forum


A web based forum is available where users can post questions about Solaris CAT.

http://supportforum.sun.com/solariscat

Getting Started with Solaris CAT 10


What Now?
Now, you’re ready to start experimenting with Solaris CAT. Use the chapters that follow as a guide to
using the tool.

11 Solaris Corefile Analysis Tool


CHAPTER 2

Examining a Running Kernel

This chapter is a guide to reading and (carefully) writing values in a running kernel. The purpose of
putting this chapter before crashes and hangs is that many users will not have corefiles available to
practice on. Writing vaules to a running kernel is a great way to make some; we can examine them later.

What am I looking for?


The main purpose of opening up a running kernel is to check on the value of a variable, or to verify that
an /etc/system setting took effect after a reboot. In more unusual circumstances, an administrator might
wish to get a list of all threads on a system when it’s not behaving as expected, yet not completely hung.

Types, sizes and the header files


The data in the kernel, as visible to Solaris CAT, is laid out across a range of virtual memory addresses.
The contents of memory at a given address can be interpreted differently depending on how many bytes
the tool reads. Consequently, it’s important to know the data type of the value you’re looking for. Some
variables are four bytes long, some just one, and sometimes it depends on wether you’re running a 32 or
64 bit kernel. Solaris CAT has several commands for helping you get at the data you want.

Consider the rootvp variable, which is a pointer to a struct of type vnode. On a 32 bit kernel, a pointer is
4 bytes long, whereas on a 64 bit kernel a pointer is twice that length. Solaris CAT does some of the work
here for you; the rd command will read one “word” from memory. This corresponds to four bytes on a
32 bit kernel, and eight on a 64 bit kernel.

scat4.0beta2(live):19> rd rootvp
0x10438e28 genunix:rootvp+0x0 = 0x70451f04

Examining a Running Kernel 12


scat4.0beta2(live):13> rd rootvp
0x14762c0 genunix:rootvp+0x0 = 0x300007f9eb0

The first example above, from a 32 bit system, shows the address we’re interested in as a four byte
quantity. The second example comes from a 64 bit system, and is an eight byte quantity. The commands
rd32 and rd64 would have produced the same output for the first and second examples shown above
respectively.

So now we know where rootvp is in memory, let’s dump out the memory starting at that address as a
vnode structure:
scat4.0beta2(live):20> sdump 0x70451f04 vnode
{
v_lock = {
_opaque = [ NULL 0xbaddcafe ]
}
v_flag = 0x0
v_count = 0x1
v_vfsmountedhere = NULL
v_op = 0x1044008c (specfs:spec_vnodeops+0x0)
v_vfsp = NULL
v_stream = NULL
v_pages = NULL
v_type = 0x3 (undefined enum constant)
v_rdev = 0x3740000 (221(vxio),0)
v_data = 0x70451f00
v_filocks = NULL
v_shrlocks = NULL
v_cv = {
_opaque = 0x0
}
v_locality = 0xbaddcafe
}

The above command could also have been written sdump *rootvp vnode and produced the same output.

Without the source code, it is more difficult to make sense of a corefile, but the header files found in /usr/
include are a great help. In the example above, we saw a vnode, perhaps for the first time. To find out
more, we could look in the header files.

% find /usr/include -type f | xargs grep "typedef struct vnode"


/usr/include/sys/vnode.h:typedef struct vnode {
/usr/include/sys/vnode.h:typedef struct vnodeops {

From there, we can look in /usr/include/sys/vnode.h for the type definition of a vnode (awfully
formatted to show one element per line):

13 Solaris Corefile Analysis Tool


typedef struct vnode {
kmutex_t v_lock; /* protects vnode fields */
uint_t v_flag; /* vnode flags (see below) */
uint_t v_count; /* reference count */
struct vfs *v_vfsmountedhere; /* ptr to vfs mounted here */
struct vnodeops *v_op; /* vnode operations */
struct vfs *v_vfsp; /* ptr to containing VFS */
struct stdata *v_stream; /* associated stream */
struct page *v_pages; /* vnode pages list */
struct vnode *v_next; /* vnode list pointers are */
struct vnode *v_prev; /* used when v_pages is set */
enum vtype v_type; /* vnode type */
dev_t v_rdev; /* device (VCHR, VBLK) */
caddr_t v_data; /* private data for fs */
struct filock *v_filocks; /* ptr to filock list */
struct shrlocklist *v_shrlocks; /* ptr to shrlock list */
kcondvar_t v_cv; /* synchronize locking */
void *v_locality; /* hook for locality info */
krwlock_t v_nbllock; /* sync for NBMAND locks */
} vnode_t;

Each of the structs shown above are also documented similarly under the /usr/include directory, and
many of the header files have comments like those above explaining what the different elements of the
struct do.

In addition to the files in /usr/include, Solaris CAT provides a means for showing the elements of a
structure: The stype command:

scat4.0beta2(live):2> stype vnode


struct vnode { (size: 0x88 bytes)
typedef kmutex_t = struct mutex { (size: 0x8 bytes)
void *[0x1] _opaque; (offset 0x0 bytes, size 0x8 bytes)
} v_lock; (offset 0x0 bytes, size 0x8 bytes)
typedef uint_t = unsigned int v_flag; (offset 0x8 bytes, size 0x4 bytes)
typedef uint_t = unsigned int v_count; (offset 0xc bytes, size 0x4 bytes)
struct vfs *v_vfsmountedhere; (offset 0x10 bytes, size 0x8 bytes)
struct vnodeops *v_op; (offset 0x18 bytes, size 0x8 bytes)
struct vfs *v_vfsp; (offset 0x20 bytes, size 0x8 bytes)
struct stdata *v_stream; (offset 0x28 bytes, size 0x8 bytes)
struct page *v_pages; (offset 0x30 bytes, size 0x8 bytes)
struct vnode *v_next; (offset 0x38 bytes, size 0x8 bytes)
struct vnode *v_prev; (offset 0x40 bytes, size 0x8 bytes)
enum vtype {VNON=0, VREG=1, VDIR=2, VBLK=3, VCHR=4, VLNK=5, VFIFO=6,
VDOOR=7, VPROC=8, VSOCK=9, VBAD=10} v_type; (offset 0x48 bytes, size 0x4 bytes)
typedef dev_t = unsigned long/long long v_rdev; (offset 0x50 bytes, size
0x8 bytes)
typedef caddr_t = char *v_data; (offset 0x58 bytes, size 0x8 bytes)

Examining a Running Kernel 14


struct filock *v_filocks; (offset 0x60 bytes, size 0x8 bytes)
struct shrlocklist *v_shrlocks; (offset 0x68 bytes, size 0x8 bytes)
typedef kcondvar_t = struct _kcondvar { (size: 0x2 bytes)
typedef ushort_t = unsigned short _opaque; (offset 0x70 bytes, size 0x2
bytes)
} v_cv; (offset 0x70 bytes, size 0x2 bytes)
void *v_locality; (offset 0x78 bytes, size 0x8 bytes)
typedef krwlock_t = struct _krwlock { (size: 0x8 bytes)
void *[0x1] _opaque; (offset 0x80 bytes, size 0x8 bytes)
} v_nbllock; (offset 0x80 bytes, size 0x8 bytes)
} ;

Writing variables on the fly


The procedures contained in this section are inherently dangerous. A mistake here can result in
downtime, and even an unbootable system. Bearing that in mind, writing to the running kernel is still
worth knowing how to do. By default, Solaris CAT is started in read-only mode for obvious safety
reasons. In order to open the kernel with write privileges, the --write flag must be specified on the
command line:

Tip – Don’t do the example below. You have been warned.

# /opt/SUNWscat/bin/scat --write
...
scat4.0beta2(live):0> rd memscrub_verbose
0x140b904 unix:memscrub_verbose+0x0 = 0x0
scat4.0beta2(live):1> wr memscrub_verbose 1
scat4.0beta2(live):2> rd memscrub_verbose
0x140b904 unix:memscrub_verbose+0x0 = 0x1

At this point, the astute reader will note that the above was done without knowing the type of
memscrub_verbose, and was therefore a very dangerous operation indeed. We wrote the value of 0x1,
but how many bytes did we write? The answer in this case is the default, a word. This is a very bad thing
for us, as memscrub_verbose is an unsigned int (4 bytes) and we’ve now written 8 bytes, because this is a
64 bit kernel.

scat4.0beta2(live):5> rdb memscrub_verbose 10


0x140b904 unix:memscrub_verbose+0x0 = 00
0x140b905 unix:memscrub_verbose+0x1 = 00
0x140b906 unix:memscrub_verbose+0x2 = 00
0x140b907 unix:memscrub_verbose+0x3 = 00
0x140b908 unix:memscrub_all_idle+0x0 = 00

15 Solaris Corefile Analysis Tool


0x140b909 unix:memscrub_all_idle+0x1 = 00
0x140b90a unix:memscrub_all_idle+0x2 = 00
0x140b90b unix:memscrub_all_idle+0x3 = 0x01
0x140b90c unix:memscrub_span_pages+0x0 = 00
0x140b90d unix:memscrub_span_pages+0x1 = 00

You can see that we’ve poked a 0x1 into the wrong spot. After correcting this problem, and thanking the
kernel gods that we got away with a colossal blunder, we issue the correct command this time, not taking
the default size:

scat4.0beta2(live):15> wr hword memscrub_verbose 1


scat4.0beta2(live):16> rd memscrub_verbose
0x140b904 unix:memscrub_verbose+0x0 = 0x100000000
scat4.0beta2(live):17> rdh memscrub_verbose
0x140b904 unix:memscrub_verbose+0x0 = 0x1

Note that the rd command now appears to give the wrong answer, but that’s simply because it’s the
wrong tool for the job. The rdh output gives us what we want, as we can verify by looking at the messages
file.

% tail -1 /var/adm/messages
Sep 30 19:55:40 moonstone unix: [ID 217272 kern.notice] NOTICE: Memory scrubber
read 0x2 pages starting at 0xbf000000

The moral of the story is that writing to the running kernel is a dangerous business, and you probably
shouldn’t do it unless you’re prepared to deal with the possible fallout. If you must do it, keep in mind
that there are various tricks to prevent you from writing the wrong amount of data at your target address.
Use rdb and mdump to examine memory one byte at a time. Use calc "sizeof(uint_t)" to
determine how many bytes make up a uint_t, another type, or even a struct. Last but not least, use the
information in the header files and in the Tunables Parameters Reference Manual (available at http://
docs.sun.com/?p=/doc/806-7009/) to make the most informed decisions possible.

Making your own corefile


In order to practice looking at corefiles, you might want to generate a corefile from your own system.
There are two ways to do this. It should be noted that both methods shown here are potentially hazardous
to the health of your system.

Method one
While the system is running, abruptly bring it to the ok> prompt by using the Stop-A key sequence or by
sending a BREAK signal to the serial port. Once at the ok> prompt, typing sync will cause the system to
reboot, and produce a corefile.

Examining a Running Kernel 16


Method two
This method is more interesting than the first, because it will be more similar to what might be found in
a “real” corefile. By introducing bad data into the running kernel, a system panic can be caused. In this
case, by zeroing out rootdir, which is a pointer to the vnode that represents the root directory, we can
cause the system to panic on the next filename lookup.

# /opt/SUNWscat/bin/scat --write
...
opening /dev/ksyms /dev/kmem ...symtab...core...done
loading core data: modules...panic...memory...time...misc...done
loading stabs...patches...done

files: /dev/ksyms /dev/kmem


user: Super-User (root:0)
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
hostid: 808ab867
kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)
boothowto: 0x40 (DEBUG)
time in kernel: Mon Oct 7 09:38:49 MDT 2002
age of system: 1 hours 18 minutes 7.91 seconds
ncpus: 2

running sanity checks.../etc/system...dump flags...misc...done


scat4.0beta2(live):0> wr rootdir 0

Shortly after this, the system will panic and produce a corefile useful for practicing the commands
illustrated in the follwing chapters.

Method three
A third method of corefile creation exists for systems that have a dedicated dump device. A dedicated
dump device is a partition set aside for crash dumps that is not also a swap device. These systems can use
the savecore -L command to produce a corefile without bringing the system down. This functionality
was intorduced in Solaris 7. One drawback to this method is that the core image is collected without
stopping the running kernel, and so inconsistencies may appear in the corefile. For example, it may be
difficult to walk a linked list that was being updated at the time.

17 Solaris Corefile Analysis Tool


CHAPTER 3

Examining corefiles from a system


crash

This chapter deals with the analysis of corefiles produced when a system calls the panic routine. If you
created a corefile with method two described at the end of Chapter two, it can be used to follow along
with the examples given below.

What am I looking for?


Initial investigation for this class of corefiles generally starts at the panic string, and examination of the
panic_thread. This information gives us our initial direction. Often, and especially in the case of
hardware failures, the message buffer will contain important clues.

Matching a known bug


In this example, we’re looking at corefile from a panic, and checking to see if this is a bug that’s already
known. This is typically one of the first things to check off the list, as not checking will cause duplication
of effort if this is a known bug. If the corefile turns out not to be a known bug, that’s still a valuable piece
of information to have. Let’s look at the corefile

% scat 0
files: /cores/checkpage/*.0
release: 5.6
version: Generic_105181-09
machine: sun4u
hw_provider: Sun_Microsystems

Examining corefiles from a system crash 18


system type: SUNW,Ultra-Enterprise
time of crash: Thu Aug 24 23:13:14 MDT 2000 (core is 766 days old)
age of system: 8 days 1 hours 44 minutes 24.57 seconds
panic cpu: 5 (ncpus: 4)
panic string: trap

running sanity checks.../etc/system...rmap...dump flags...misc...done

Without running any commands, we don’t know too much about this problem. The system paniced with
a bad trap, but that could be virtually anything. We need more information. Note that the output of
Solaris CAT may appear differently based on the contents of ~/.scatinit. The output that follows was
produced with the .scatinit file shown in Chapter 1.

scat4.0beta2(vmcore.0):0> panic
panic on cpu 5
panic string: trap
==== panic kernel thread: 0x3021fe80 pid: 2 on cpu: 5 ====
cmd: pageout

t_stk: 0x3021fce0 sp: 0x3021f7a0 t_stkbase: 0x3021e000


t_pri: 97(SYS)
t_procp: 0x629d2f48(proc_pageout) p_as: 0x1041834c(kas)
idle: 425 ticks (4.25 seconds)
start: Wed Aug 16 21:34:57 2000
age: 697097 seconds (8 days 1 hours 38 minutes 17 seconds)
stime: 25057 (8 days 1 hours 37 minutes 53.62 seconds earlier)

pc: 0x1001ed98 unix:complete_panic+0x24: call unix:setjmp


startpc: 0x100d4a38 genunix:pageout_scanner+0x0: save %sp, -0x88, %sp

-- on kernel thread’s stack --


unix:complete_panic+0x24
unix:do_panic+0x158
genunix:vcmn_err+0x190
genunix:cmn_err+0x1c
unix:die+0xa0
unix:trap+0x830
unix:sfmmu_tsb_miss+0x58c
unix:prom_rtt+0x0
-- trap data type: 0x31 (data access MMU miss) rp: 0x3021fb60 --
pc: 0x100d4f6c genunix:checkpage+0xc4: lduh [%o0 + 0x8], %o0
npc: 0x100d4f70 genunix:checkpage+0xc8: btst %o0, %o1
global: %g1 0x10414800 %g2 0x1407e5ed %g3 0x20
%g4 0x7 %g5 0x8000d52bca45a698 %g6 0 %g7 0x3021fe80
out: %o0 0 %o1 0x1000 %o2 0x10414cc8 %o3 0x1048e330
%o4 0x12034b80 %o5 0 %sp 0x3021fbf0 %o7 0x100d4f04
loc: %l0 0x8 %l1 0 %l2 0x7 %l3 0

19 Solaris Corefile Analysis Tool


%l4 0x1043a000 %l5 0xe %l6 0 %l7 0x3021fbc8
in: %i0 0x12034b80 %i1 0x2 %i2 0x10437a40 %i3 0x10428800
%i4 0x10428800 %i5 0 %fp 0x3021fc58 %i7 0x100d4d10
<trap>genunix:checkpage+0xc4
genunix:pageout_scanner+0x2d8
unix:thread_start+0x4
-- end of kernel thread’s stack --

The panic command displays the stack of the panic thread, along with various bits of helpful information
about that thread. The command thread *panic_thread would deliver the same results. Reading
the stack from the bottom up, we can tell we’re looking at the pageout command, that the routine
pageout_scanner called checkpage, and something went terribly wrong in the checkpage routine.

The trap frame tells us that the %pc register contained the instruction lduh [%o0 + 0x8], %o0 when
things went bad. Looking at this instruction, we can see that we were trying to take the contents of register
%o0, adding eight, dereferencing an unsigned halfword, and putting the result back into %o0. Further,
the trap frame shows us that %o0 was 0x0 at the time; the scanner was trying to dereference memory
address 0x8, and 0x8 isn’t a good address. So now, the questions becomes why was %o0 NULL? At
thispoint, we should check to see if this is a known bug. Plugging the string “checkpage
pageout_scanner” into sunsolve brings up bug 4188132, which was closed as a duplicate of 4169509. Bug
4169509 was fixed in kernel patch 105181-13, and we already know from the Solaris CAT preamble that
this system was behind on patches.

Filesystem corruption
The purpose of calling panic and creating a corefile is to stop processing once the kernel determines that
continuing could cause further problems. An example of this is trying to free an object that is already in
that state. Let’s look at a corefile that illustrates this.
% scat 9
files: /cores/freeing/*.9
release: 5.6
version: Generic_105181-06
machine: sun4d
hw_provider: Sun_Microsystems
system type: SUNW,SPARCserver-1000
time of crash: Fri Mar 12 06:44:23 MST 1999 (core is 1298 days old)
age of system: 14 minutes 18.53 seconds
panic cpu: 5 (ncpus: 6)
panic string: free: freeing free frag, dev:0x800080, blk:232, cg:120,
ino:683524, fs:/disk4
running sanity checks.../etc/system...rmap...dump flags...misc...done

scat4.0beta2(vmcore.9):0> panic

Examining corefiles from a system crash 20


panic on cpu 5
panic string: free: freeing free frag, dev:0x800080, blk:232, cg:120,
ino:683524, fs:/disk4
==== panic user thread: 0xf0f73bc0 pid: 339 on cpu: 5 ====
cmd: /usr/sbin/nsr/nsrd

t_stk: 0xe0b9ab80 sp: 0xe0b9a4a0 t_stkbase: 0xe0b99000


t_pri: 50(TS) t_lwp: 0xf0b8b050 machpcb: 0xe0b9ab80
t_procp: 0xf0a4c898 p_as: 0xf023d9c0 hat: 0xf032f4d0
size: 3117056 rss: 2134016
last cpuid: 5
idle: 50 ticks (0.50 seconds)
start: Fri Mar 12 06:44:16 1999
age: 7 seconds (7 seconds)
stime: 85565 (0 seconds later)
syscall: open (0x1, 0xdffffac0, 0xdffffac0)
tstate: TS_ONPROC - thread is being run on a processor
tflg: none set
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SLOAD - in core
SULOAD - u-block in core
SJCTL - SIGCLD sent when children stop/continue

pc: 0xe004bf6c unix:complete_panic+0x24: call unix:setjmp

unix:complete_panic+0x24 (0xe024f000, 0x0, 0x1, 0xe0243800, 0x0, 0x0)


unix:do_panic+0xa4 (0x1, 0xe0b9a73c, 0x0, 0xf03fd570, 0xf0fd75e8, 0x40401ae4)
genunix:vcmn_err+0x180 (0x3, 0xf03f6078, 0xe0b9a73c, 0x3, 0x0, 0x40401ae5)
ufs:real_panic_v+0x6c (0x0, 0xf03f6078, 0xe0b9a73c, 0x1, 0xf0ff4be0,
0x40401ae6)
ufs:ufs_fault_v+0x4c (0x5b, 0xf03f6078, 0xe0b9a73c, 0x404010e7, 0x0,
0xe0b9a880)
ufs:ufs_fault+0x1c (0xe0b9a880, 0xf03f6078, 0x800080, 0xe8, 0x78, 0xa6e04)
ufs:free+0x42c (0xe8, 0x0, 0xf0f4dd10, 0xe8, 0x1c00, 0xf0c12000)
ufs:ufs_itrunc+0x79c (0x3, 0xe0b9a804, 0xffffffff, 0x8, 0xf105e410,
0xf0c12000)
ufs:ufs_trans_itrunc+0x50 (0xf105e4fc, 0x0, 0x0, 0x0, 0xf0248f68, 0xf105e410)
ufs:ufs_setattr+0xfc (0xf105e498, 0x0, 0x0, 0xf0248f68, 0x1, 0xf105e4fc)
genunix:vn_open+0x334 (0x0, 0x0, 0x207, 0x1a4, 0x0, 0x0)
genunix:copen+0x84 (0x168bb8, 0x207, 0x1a4, 0x7efefeff, 0x0, 0x0)
unix:syscall_trap+0x104 (0x168bb8, 0x206, 0x1a4, 0xdf6e2e8c, 0xdf6e2e8c,
0xdf68674c)
-- switch to user thread’s user stack --

21 Solaris Corefile Analysis Tool


In this case, the coreinfo generated gives most of the information needed to correctly diagnose this
panic, and generally this type of panic will fall into one of two categories. The first category, and the most
common case is that this type of panic will show up shortly after the system was ungracefully shutdown
or a mounted filesystem was fscked. Once the corruption is introduced into the filesystem, the system
will panic shortly thereafter. We can then umount the filesystem at fault, and fsck the raw device
associated with it, using the -of option.

The second category can be more complicated, and involves failing hardware or misconfiguration. In
these cases, it’s important to use the msgbuf command to examine the message buffer for HW errors. The
/var/adm/messages file should also be examined. If there are no obvious signs of a hardware problem,
then it is likely that there is a configuration problem; check for a swap partition that overlaps a mounted
filesystem. It’s also worth checking that the system is up to date on related patches like ufs, scsi, sd, etc...
In the case above, the disk containing the /disk4 filesystem was replaced, and the problem was not seen
again.

Bad Traps
Traps happen all the time while the system is running. For example system calls, and references to pages
not currently in main memory cause traps. Normally, this isn’t a problem, but when something goes
wrong in this operation, the system must panic and generate a corefile. Here is an example of a trap that
went bad when the virtual address requested could not be found while in privileged mode.

% scat 0
files: /cores/badtrap/*.0
release: 5.6
version: Generic_105181-20
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-Enterprise
boothowto: 0x1000 (RECONFIG)
time of crash: Wed Nov 15 22:37:12 MST 2000 (core is 684 days old)
age of system: 7 days 2 hours 14 minutes 59.24 seconds
panic cpu: 10 (ncpus: 2)
panic string: trap

running sanity checks.../etc/system...rmap...dump flags...misc...done


scat4.0beta2(vmcore.0):0> panic
cpu 6 had the panic
panic string: trap

==== panic thread: 0x30023e80 ==== cpu: 6 ====


==== thread: 0x30023e80 ==== pid: 0 ====
cmd: sched

Examining corefiles from a system crash 22


t_wchan: 0x0
t_stk: 0x30023ce0 sp: 0x30023428 t_stkbase: 0x30022000
pc: 0x1001e378 unix:complete_panic+0x24: call unix:setjmp
startpc: 0x100cda04 genunix:thread_create_intr+0: save %sp,
-0x68, %sp
t_pri: 104(SYS) t_lwp: 0x0
t_procp: 0x10417f10(proc_sched) p_as: 0x10417ea8(kas)
bound cpuid: 6 last cp
uid: 6
idle: 128 ticks (1.28 seconds)
age: 145969 ticks (24 minutes 19.69 seconds)
interrupted (pinned) thread: 0x3002be80
tstate: TS_ONPROC - thread is being run on a processor
tflg: T_INTR_THREAD - thread is an interrupt thread
T_TALLOCSTK - thread structure allocated from stk
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_CSTART - setrun() by lwp_continue()
pflag: SSYS - system resident process
SLOAD - in core
SLOCK - process cannot be swapped
SULOAD - u-block in core

unix:complete_panic+0x24 (0xc, 0x10407400, 0x0, 0x3, 0x0, 0x0)


unix:do_panic+0x174 (0x10404000, 0x1, 0x1040cfb8, 0x0, 0x20, 0x0)
genunix:vcmn_err+0x190 (0x3, 0x1040e188, 0x3, 0x30023604, 0x0, 0x10413530)
genunix:cmn_err+0x1c (0x3, 0x1040e188, 0x3002bc20, 0x14d, 0x14d, 0x10407400)
unix:die+0xa0 (0x31, 0x300237e8, 0x6506e000, 0x0, 0x1040e188, 0x19)
unix:trap+0x830 (0x300237e8, 0x0, 0x6506e000, 0x1, 0x0, 0x6)
unix:sfmmu_tsb_miss+0x58c (0x1041c498, 0x0, 0x31, 0x0, 0x
6018ff80, 0x6506e000)
unix:prom_rtt+0 (0x7, 0x0, 0x10407000, 0x0, 0x2, 0x7f)
-- trap data type: 0x31 (data access MMU miss) rp: 0x300237e8 --
pc: 0x1000bb0c unix:blkdone+0x288: ldd [%i0 + %i1], %o4
npc: 0x1000bb10 unix:blkdone+0x28c: std %o4, [%i1]
global: %g1 0x10431000 %g2 0x6276d998 %g3 0x1
%g4 0 %g5 0 %g6 0 %g7 0x30023e80
out: %o0 0x7 %o1 0 %o2 0x10407000 %o3 0
%o4 0x2 %o5 0x7f %sp 0x30023878 %o7 0x1000bafc
loc: %l0 0 %l1 0x627520f8 %l2 0x627520f8 %l3 0x6040e98c
%l4 0x6537ff54 %l5 0x6537fe70 %l6 0 %l7 0x304b9ae0
in: %i0 0xfd7ef9b0 %i1 0x6787e650 %i2 0x4 %i3 0x18
%i4 0x1000bf20 %i5 0 %fp 0x30023a20 %i7 0x60439558
<trap>unix:blkdone+0x288 (0xfd7ef9b0, 0x6787e650, 0x4, 0x18, 0x1000bf20, 0x0)
fcaw:fca_cmd_complete+0x418 (0x6787e038, 0x0, 0x4, 0x1, 0xc0000000, 0
x6787e600)
fcaw:fca_highintr+0x603c (0x0, 0x10f, 0x1016, 0x53f, 0x10414cc0, 0x626c2000)

23 Solaris Corefile Analysis Tool


sbus:sbus_intr_wrapper+0x18 (0x61cef110, 0x5, 0x601cae84, 0x104132a8, 0x38b0,
0x6038eaf8)
unix:intr_thread+0x88 (0x0, 0xa, 0x0, 0x10413020, 0x0, 0x0)
unix:prom_rtt+0 (0x0, 0x1003beac, 0x0, 0x0, 0x0, 0x63d58000)
-- interrupt data rp: 0x3002bb90
pc: 0x1002e38c unix:splx+0x1c: retl
npc: 0x1002e390 unix:splx+0x20: mov %o1, %o0
global: %g1 0xba0 %g2 0x1041862c %g3 0
%g4 0 %g5 0 %g6 0 %g7 0x3002be80
out: %o0 0 %o1 0xa %o2 0 %o3 0x10413020
%o4 0 %o5 0 %sp 0x3002bc20 %o7 0x10039c28
loc: %l0 0x1e02 %l1 0x1e %l2 0x5 %l3 0
%l4 0 %l5 0 %l6 0 %l7 0x3002bbf0
in: %i0 0x10413020 %i1 0x1041463c %i2 0x23a31 %i3 0
%i4 0 %i5 0 %fp 0x3002bc80 %i7 0x100382dc
unix:splx+0x1c (0x10413020, 0x1041463c, 0x23a31, 0x0, 0x0, 0x0)
unix:disp_getwork - frame recycled
unix:idle+0xa0 (0x10413020, 0x6, 0x0, 0x10417f10, 0x0, 0x0)
unix:thread_start+0x4 (0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
-- error reading next frame --

Reading this stack from the bottom up, we can see that things are moving along smoothly until we’re deep
into a routine called blkdone. The blkdone routine is actually a subroutine within another routine called
bcopy.

scat4.0beta2(vmcore.0):1> rdi fca_cmd_complete+0x418


fcaw:fca_cmd_complete+0x418: call unix:bcopy

A little searching through /usr/include tells us that the bcopy routine takes three arguments, a source
address, a destination address, and a number of bytes to copy from source to destination. The question
now becomes what arguments does bcopy get from the routine that calls it? In order to find out, we need
to examine the instructions in fca_cmd_complete just prior to the transfer of control to bcopy. We can take
advantage of the fact that in the SPARC architecture, the %o registers of the calling routine become the %i
registers of the called routine to determine what arguments (the %i regs) bcopy was called with.

scat4.0beta2(vmcore.0):2> rdi -p fca_cmd_complete+0x418 10


fcaw:fca_cmd_complete+0x3f0: lduw [%fp - 0x1c], %l0
fcaw:fca_cmd_complete+0x3f4: lduw [%l0 + 0x14], %l0
fcaw:fca_cmd_complete+0x3f8: ba fcaw:fca_cmd_complete+0x404 (1f)
fcaw:fca_cmd_complete+0x3fc: stw %l0, [%fp - 0x20]
fcaw:fca_cmd_complete+0x400: clr [%fp - 0x20]
fcaw:fca_cmd_complete+0x404: 1: lduw [%fp - 0x1c], %l0
fcaw:fca_cmd_complete+0x408: lduw [%fp - 0x20], %l1
fcaw:fca_cmd_complete+0x40c: add %l0, %l1, %l0
fcaw:fca_cmd_complete+0x410: add %l0, 0x18, %o0
fcaw:fca_cmd_complete+0x414: add %i5, 0xc, %o1
fcaw:fca_cmd_complete+0x418: call unix:bcopy

Examining corefiles from a system crash 24


fcaw:fca_cmd_complete+0x41c: lduw [%fp - 0x14], %o2

So we can see that the source address is %l0 + 0x18, the destination address is %i5 + 0xc, and the number
of bytes is the unsigned word at the address pointed to by %fp - 0x14. We’re almost there, we just need
to know what the contents of %l0, %i5 and %fp were when fca_cmd_complete was on the stack. Lucky
for us, Solaris CAT provides us with a command to do just that.

scat4.0beta2(vmcore.0):3> frame fca_cmd_complete

frame(%sp) @ 0x30023a20 on thread stack, size 0xa8(MINFRAME+0x48)


fcaw:fca_cmd_complete+0x418 call unix:bcopy

loc: %l0 0x6506dfa4 %l1 0x8 %l2 0x60 %l3 0xc0e0


%l4 0x603a7800 %l5 0 %l6 0x61cd4a88 %l7 0
in: %i0 0x6787e038 %i1 0 %i2 0x4 %i3 0x1
%i4 0xc0000000 %i5 0x6787e600 %fp 0x30023ac8 %i7 0x60435adc

Now, we can use the values from the frame command to determine the arguments passed to bcopy.

scat4.0beta2(vmcore.0):4> rd 0x6506dfa4+0x18
0x6506dfbc = 0x70000200

scat4.0beta2(vmcore.0):5> rd 0x6787e600+0xc
0x6787e60c = 0x70000200

scat4.0beta2(vmcore.0):6> rd 0x30023ac8-0x14
0x30023ab4 = 0x60

Examining the 96 (or 0x60) bytes at the source address, we can see the problem.

scat4.0beta2(vmcore.0):9> mdump 0x6506dfbc 0x60


0x6506dfbc 70 00 02 00 00 00 00 12 00 00 00 00 04 00 00 00 : p...............
0x6506dfcc 00 00 00 00 00 00 00 00 00 02 02 00 00 00 00 00 : ................
0x6506dfdc 00 00 00 00 60 0d 52 68 65 06 c0 18 60 0d 52 94 : ....‘.Rhe...‘.R.
0x6506dfec 67 97 7f e0 65 06 df 50 65 06 c0 50 00 00 00 02 : g...e..Pe..P....
0x6506dffc kvm_read failed for char @ 0x6506e000

The bcopy routine was told to copy 0x60 bytes from a buffer that is only 0x44 (0x6506e000-0x6506dfbc)
bytes long. The vendor of the fcaw driver produced a patch for this bug.

25 Solaris Corefile Analysis Tool


Flipped bits
There are cases in which the cause of the panic is a flipped bit that changes an address, piece of data, or
even an instruction just enough to panic a system. Often enough, examination of a corefile will show that
a piece of data was corrupted, although we often can’t say with certainty what piece of hardware caused
the bit to flip. Usually, the panicing cpu is the safest bet, however in this case, the cpu running the panic
thread is probably at fault.

% scat 8

core file: /cores/Bit_Flip/vmcore.8


release: 5.7 (64-bit)
version: Generic
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-4
time of crash: Thu Dec 7 09:16:29 MST 2000 (core is 667 days old)
age of system: 10 hours 34 minutes 26.01 seconds
panic cpu: 3 (ncpus: 2)
panic string: trap

running sanity checks.../etc/system...rmap...dump flags...misc...done


scat4.0beta2(vmcore.8):0> panic
panic on cpu 3
panic string: trap
==== panic kernel thread: 0x2a1001fdd60 pid: 0 ====
cmd: sched
t_wchan: 0x30002115c28 sobj: mutex
t_stk: 0x2a1001fdb50 sp: 0x2a1001fc141 t_stkbase: 0x2a1001fa000
t_pri: 60(SYS)
t_procp: 0x104233f0(proc_sched) p_as: 0x10423320(kas)
idle: 0 ticks (0 seconds)
start: Wed Dec 6 22:42:58 2000
age: 38011 seconds (10 hours 33 minutes 31 seconds)
stime: 3135 (10 hours 33 minutes 29.63 seconds earlier)

pc: 0x100101d0 unix:complete_panic+0x20: call unix:setjmp


startpc: 0x100ca30c genunix:background+0x0: save %sp, -0xd0, %sp

-- on kernel thread’s stack --


unix:complete_panic+0x20
unix:do_panic+0x158
genunix:vcmn_err+0x198
genunix:cmn_err+0x1c

Examining corefiles from a system crash 26


unix:die+0xb0
unix:trap+0x85c
unix:sfmmu_tsb_miss+0x644
unix:prom_rtt+0x0
-- trap data type: 0x31 (data access MMU miss) rp: 0x2a1001fd000 --
pc: 0x100e3ad8 genunix:turnstile_interlock+0x4: ldx [%i1], %i3
npc: 0x100e7adc genunix:anon_init+0x30: sra %i2, 0x0, %g4
global: %g1 0x22aed2533d36
%g2 0 %g3 0
%g4 0x10107800 %g5 0x300012c3b70
%g6 0 %g7 0x2a1001fdd60
out: %o0 0x4000 %o1 0
%o2 0x3c %o3 0x10471c70
%o4 0x28201f09 %o5 0x300012c3b70
%sp 0x2a1001fc8a1 %o7 0xffffffffffffffff
loc: %l0 0 %l1 0x30000aee900
%l2 0x80 %l3 0
%l4 0 %l5 0
%l6 0 %l7 0xedc28
in: %i0 0x1045aea0 %i1 0x4190
%i2 0x30000ce9090 %i3 0
%i4 0 %i5 0x300004c18d0
%fp 0x2a1001fc951 %i7 0x100e3d94
<trap>genunix:turnstile_interlock+0x4
genunix:turnstile_block+0x168
unix:mutex_vector_enter+0x2d8
unix:mutex_enter - frame recycled
genunix:rmvq_noenab+0x4c
genunix:strrput+0x260
unix:putnext+0xf8
tcp:tcp_rput_data+0x2f18
unix:putnext+0x68
ip:ip_rput_local - frame recycled
ip:ip_rput+0x17c
unix:putnext+0xf8
ge:gersrv+0x14c
genunix:runservice+0x3c
genunix:background+0xd8
unix:thread_start+0x4
-- end of kernel thread’s stack --

The cause of the panic can be seen by looking at the trap frame %pc, and examining the contents of
memory that instruction was attempting to dereference. Here, the system was attempting to dereference
the contents of %i1, which was 0x4190 at the time, and that is an invalid address. We can confirm this with
the msgbuf command.

scat4.0beta2(vmcore.8):3> msgbuf

27 Solaris Corefile Analysis Tool


...
SUNW,hme0: Using Internal Transceiver
SUNW,hme0: 100 Mbps full-duplex Link Up
BAD TRAP: cpu=3 type=0x31 rp=0x2a1001fd000 addr=0x4190 mmu_fsr=0x0
sched: trap type = 0x31
addr=0x4190
pid=0, pc=0x100e3ad8, sp=0x2a1001fc8a1, tstate=0x9980001607, context=0x0
g1-g7: 22aed2533d36, 0, 0, 10107800, 300012c3b70, 0, 2a1001fdd60
Begin traceback... sp = 2a1001fc8a1
Called from 100e3d94, fp=2a1001fc951, args=1045aea0 4190 30000ce9090 0 0
300004c18d0
...

Note that the addr= section of the BAD TRAP message confirms our hypothesis. The next step involves
trying to determine why that register contains the wrong value, but in this case, the trap frame holds more
information worth examining. Note that the %pc and %npc contain instructions from two different
functions, and neither of them is a transfer of control instruction. Examination of the code surrounding
both the %pc and %npc indicates that they cannot possibly be consecutive, even considering the “delay
slot” used with transfer of control instructions. One of them must be wrong. Given the previous function
on the stack, the %npc looks like it must be wrong.

scat4.0beta2(vmcore.8):5> rdi genunix:turnstile_block+0x168


genunix:turnstile_block+0x168: call genunix:turnstile_interlock

It’s possible that the instructions were corrupted in memory, so dump them out from the corefile.

scat4.0beta2(vmcore.8):7> rdi turnstile_interlock 6


genunix:turnstile_interlock+0x0: save %sp, -0xb0, %sp
genunix:turnstile_interlock+0x4: ldx [%i1], %i3
genunix:turnstile_interlock+0x8: set 0x1045a800, %g4
genunix:turnstile_interlock+0xc: mov %i0, %i2
genunix:turnstile_interlock+0x10: add %g4, 0x318, %i0
genunix:turnstile_interlock+0x14: cmp %i3, %i2

Since turnstile_interlock+0x8 is correct in memory, we need to determine how the %npc got loaded with
something else. Knowing what ought to be there helps in this search. The %pc contains 0x100e3ad8, and
SPARC instructions are four bytes long; therefore, the %npc should contain 0x100e3adc but instead
contains 0x100e7adc. How many bits difference are there between what is there, and what ought to be
ther?

scat4.0beta2(vmcore.8):9> bits 0x100e3adc 0x100e7adc


1 0 0 e. 3 a d c
1 0000 0000 1110.0011 1010 1101 1100
*
1 0000 0000 1110.0111 1010 1101 1100
1 0 0 e. 7 a d c
------------------------------------

Examining corefiles from a system crash 28


8 7654 3210 9876 5432 1098 7654 3210
2 1 0
1 bits differ (14)

Another handy command for getting at the same information is the flip command.

scat4.0beta2(vmcore.8):10> flip -i genunix:turnstile_interlock+0x8


bit 0: 0x100e3add genunix:turnstile_interlock+0x9 illegaltrap
0x00116ab4
bit 1: 0x100e3ade genunix:turnstile_interlock+0xa fba,pt 0x2b410
bit 2: 0x100e3ad8 genunix:turnstile_interlock+0x4 ldx [%i1], %i3
bit 3: 0x100e3ad4 genunix:turnstile_interlock+0x0 save %sp, -0xb0, %sp
bit 4: 0x100e3acc genunix:turnstile_exit+0x28 call unix:disp_lock_exit
bit 5: 0x100e3afc genunix:turnstile_interlock+0x28 tst %o0
bit 6: 0x100e3a9c genunix:turnstile_lookup+0x60 ret
bit 7: 0x100e3a5c genunix:turnstile_lookup+0x20 add %g4, 0x318, %l1
bit 8: 0x100e3bdc genunix:turnstile_interlock+0x108 ldub [%g7 + 0xa1],
%g4
bit 9: 0x100e38dc genunix:turnstile_pi_inherit+0x18 ldx [%i0 + 0x20],
%g4
bit 10: 0x100e3edc genunix:turnstile_wakeup+0xc4 ldx [%i4], %g4
bit 11: 0x100e32dc genunix:tsd_exit+0x10c ldx [%i2 + 0x18], %o0
bit 12: 0x100e2adc genunix:savectx+0x2c bne,a,pt %xcc,
genunix:savectx+0x18
bit 13: 0x100e1adc genunix:nanosleep+0x270 add %fp, 0x7af, %o0
bit 14: 0x100e7adc genunix:anon_init+0x30 sra %i2, 0x0, %g4
bit 15: 0x100ebadc genunix:as_ctl+0x2cc ldx [%l4 + 0x18], %o1
bit 16: 0x100f3adc genunix:vn_remove+0x2c add %fp, 0x7d7, %o4
bit 17: 0x100c3adc genunix:strgetmsg+0x3c clr [%o1]
bit 18: 0x100a3adc genunix:segkp_cache_free+0x94 cmp %l0, 0x0
bit 19: 0x10063adc genunix:packint32+0x34 restore
bit 20: 0x101e3adc tcp:tcp_lookup_reversed+0x104 ldub [%i3 + 0xe], %g4
bit 21: 0x102e3adc rlmod:rlmodclose+0xb8 cmp %o1, 0x0
bit 22: 0x104e3adc illegaltrap 0x00000000
bit 23: 0x108e3adc illegaltrap 0x00000000
bit 24: 0x110e3adc illegaltrap 0x00000000

Stack Overflows
Each time one function calls another, the calling function must save enough of it’s state such that when
the callee returns, the caller can pick up where it left off. For almost every function in the kernel that gets
called, a “stack frame” is saved in memory. This can result in a problem if the number of functions called

29 Solaris Corefile Analysis Tool


in a single stack grow beyond the finite amount of memory allocated to the kernel thread associated with
the stack. When a stack grows beyond the stacksize it’s thread has been allocated, the system will panic.
There are several telltale signs in a corefile that identify it as a stack overflow.

% scat 0
...
core file: /cores/stack_overflow/vmcore.0
release: 5.7 (64-bit)
version: Generic_106541-12
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-Enterprise
time of crash: Thu Dec 21 13:02:00 MST 2000 (core is 653 days old)
age of system: 1 days 18 hours 2 minutes 25.63 seconds
panic cpu: 1 (ncpus: 16)
panic string: Kernel panic at trap level 2

running sanity checks.../etc/system...


WARNING: lwp_default_stksize set to multiple values
set to 0x6000
set to 0x4000
WARNING: rpcmod:svc_run_stksize set to multiple values
set to 0x6000
set to 0x4000
rmap...dump flags...misc...done

Before even typing the first command for this corefile, we have two decent clues that suggest this may be
a stack overflow corefile. First is the panic string; this does not by any means indicate a definite stack
overflow, but it is the most common panic string for this type of panic. The second clue is that the /etc/
system sanity checks failed. The variables lwp_default_stksize and rpcmod:svc_run_stksize control the
amount of space a kernel thread is alloted for stack growth. The system administor of this system set these
parameters twice. When the /etc/system file is read, if a value is defined twice, the line nearest the
bottom of the file is used. Here, the administrator set the values twice, and had the misfortune of putting
the value that would have prevented this panic first.

scat4.0beta2(vmcore.0):0> rdh lwp_default_stksize


0x1045e19c genunix:lwp_default_stksize+0x0 = 0x4000

The output from the panic command is not going to be overly useful here. The nature of the panic
prevents the tool from showing the full stack correctly.

scat4.0beta2(vmcore.0):6> panic
panic on cpu 1
panic string: Kernel panic at trap level 2
==== panic user thread: 0x30014c5afc0 pid: 15744 on cpu: 1 ====
cmd: cp -r /u01/app/oracle/admin/devTAP /u01/app/oracle/admin/stestAT

Examining corefiles from a system crash 30


t_stk: 0x2a1013f3af0 sp: 0x1040b7a1 t_stkbase: 0x2a1013f0000
t_pri: 0(TS)
t_procp: 0x300168f2078 p_as: 0x30000a5af88 hat: 0x30000b10e18 cnum: 0x1ea
size: 1277952 rss: 1081344
idle: 2 ticks (0.02 seconds)
start: Thu Dec 21 13:01:54 2000
age: 6 seconds (6 seconds)
stime: 15123097 (0.02 seconds earlier)

pc: 0x10010550 unix:complete_panic+0x20: call unix:setjmp

-- on ptl1_stk --
unix:complete_panic+0x20
unix:do_panic+0x150
unix:panic+0x1c
unix:user_rtt+0x0
unix:page_get_freelist - frame recycled
unix:page_create_va+0x328
-- switch to user thread’s stack --
unix:segkmem_alloc+0x38
-- error reading next frame @ 0x0 --

In order to get a more accurate picture of what was happening, use the findstk command.

scat4.0beta2(vmcore.0):7> findstk
...
==== stack @ 0x2a1013f02b0 (sp: 0x2a1013efab1) ====
-- on user panic_thread’s stack --
0x0(genunix:kmem_slab_create?)
genunix:kmem_cache_alloc_global+0x40
genunix:kmem_cache_alloc+0x144
genunix:kmem_alloc+0x2c
unix:kalloca+0x5c
unix:i_ddi_mem_alloc+0x154
unix:i_ddi_mem_alloc_lim+0x80
genunix:ddi_iopb_alloc+0x6c
fcaw:fca_dma_zalloc+0x29c
fcaw:fca_pkt_dmactl+0x1acc
fcaw:fca_tran_init_pkt+0x228
scsi:scsi_init_pkt+0x54
sd:make_sd_cmd+0x130
sd:sdstart+0x114
sd:sdstrategy+0x3a4
emcp:PowerPlatformBottomDispatch+0x270
emcp:PowerDispatch+0x150
emcpcg:CgDispatch+0x1e0
emcp:PowerDispatch+0x13c

31 Solaris Corefile Analysis Tool


emcpmp:MpDispatchDown+0xa0
emcpmp:MpDispatch+0x398
emcp:PowerDispatch+0x13c
emcppn:PnCommonDispatch+0x3a8
emcppn:PnPassDispatch+0xe4
emcp:PowerDispatch+0x13c
emcp:emcp_start+0xd0
emcp:power_strategy+0x1c0
genunix:bdev_strategy+0x9c
vxio:voldiskiostart+0x6ac
vxio:vol_subdisksio_start+0x38
vxio:volkcontext_process+0x50c
vxio:volkiostart+0x690
vxio:vxiostrategy+0x78
genunix:bdev_strategy+0x9c
vxfs:vx_snap_strategy - frame recycled
vxfs:vx_io_start+0x444
vxfs:vx_io_ext+0x38
vxfs:vx_nalloc_getpage+0x1ec
vxfs:vx_do_getpage+0x7b0
vxfs:vx_do_read_ahead+0xf0
vxfs:vx_read_ahead+0x328
vxfs:vx_do_getpage+0xa40
vxfs:vx_getpage1+0x3c8
vxfs:vx_getpage+0x3c
genunix:segvn_fault+0x7b4
genunix:as_fault+0x3a4
unix:pagefault+0xc4
unix:trap+0x874
unix:prom_rtt+0x0
-- trap data type: 0x31 (data access MMU miss) rp: 0x2a1013f32c0 --
pc: 0x1000e7e4 unix:copyin_blalign+0x38: ldda [%l7] ASI_BLK_AIUS, %f0
npc: 0x1000e7e8 unix:copyin_blalign+0x3c: inc 0x40, %l7
global: %g1 0x1000f990
%g2 0xa9d64 %g3 0xa9d64
%g4 0 %g5 0x10420000
%g6 0 %g7 0x30014c5afc0
out: %o0 0x3000fd99a58 %o1 0x2a7593f8000
%o2 0 %o3 0xfffffffffffffff8
%o4 0x1 %o5 0x30000b11f88
%sp 0x2a1013f2b61 %o7 0x100b26dc
loc: %l0 0 %l1 0x10457920
%l2 0x30001303f40 %l3 0xe2000
%l4 0x2a7593fa000 %l5 0x1fe4a000
%l6 0x1 %l7 0xff250000
in: %i0 0x2a7593f8000 %i1 0xff251ff0
%i2 0x10 %i3 0x1fc0

Examining corefiles from a system crash 32


%i4 0x30 %i5 0
%fp 0x2a1013f2d61 %i7 0x1009e5d8
<trap>unix:copyin_blalign+0x38
genunix:xcopyin - frame recycled
genunix:uiomove+0xb0
vxfs:vx_write_default+0x244
vxfs:vx_write1+0x980
vxfs:vx_write+0x168
genunix:write+0x208
genunix:write32+0x30
unix:syscall_trap32+0xa8
-- switch to unknown stack (0xffbe6907) --
-- switch to user panic_thread’s user stack --

The sheer length of this stack is another indication of a stack overflow. Note that the trap frame near the
bottom of the stack is not a bad trap, but rather an access to an alternate address space. Solaris CAT prints
out the trap frame for this, although it isn’t related to the problem at hand. More hints can be gleaned by
looking at what the system was doing at trap level 1. Recall that the panic string indicated the system
paniced at trap level two. Using the ptl1 argument to the panic command shows us data specific to the
different trap levels.

scat4.0beta2(vmcore.0):8> panic ptl1


ptl1_panic_cpu: 2 (cpu number 1)
ptl1_panic_tr: 2 PTL1_BAD_KMISS
ptl1_stk: 0x104083c0
ptl1_stk_top: 0x1040c3c0
tl1 tstate: 0x4480001606
tick: 0x376743b224e4
tpc: 0x10033a40
unix:page_get_mnode_freelist+0x14: stw %i0, [%sp + 0x8eb]
tnpc: 0x10033a44
unix:page_get_mnode_freelist+0x18: srl %g4, 0x0, %o2
tt: 0x68 data access MMU miss
tl2 tstate: 0x9180001507
tick: 0x376743b224bb
tpc: 0x10006fa0
unix:have_win+0x28: stx %l0, [%l7 + 0x80]
tnpc: 0x10006fa4
unix:have_win+0x2c: stx %l1, [%l7 + 0x88]
tt: 0x68 data access MMU miss

Note that at TL1, the %pc is trying to write a value onto the stack; the presence of the %sp (stack pointer)
register in the instruction indicates this. Another hint that this is a stack overflow is the %pc at TL2.
During a stack overflow, it will often be in have_win.

33 Solaris Corefile Analysis Tool


Removing the second instance of the tunable from /etc/system solved this problem. This sort of panic is
more common than one might imagine, because the installation of VXFS adds those lines to /etc/system
as part of the installation routine. This behavior is documented in bug 4630695.

Examining corefiles from a system crash 34


CHAPTER 4

Examining corefiles from a system


hang

This chapter deals with the analysis of corefiles produced when a system is hung and unresponsive.
Corefiles produced using method one from chapter two can be used to follow along with the examples in
this chapter, although a corefile taken from a normally running system is significantly less interesting to
look at.

What am I looking for?


Analyzing corefiles from hung systems can be more intimidating than panics becuse there often isn’t an
obvious starting point. The panic string, normally ‘zero,’ does not give any more information than the
system was brought to the ok> prompt and sync was typed. Likewise, the panic_thread is also not very
helpful. Solaris CAT provides commands that examine the state of the system, pointing out unusual
conditions like large numbers of threads sleeping for a lock or a memory shortage. The corefiles below
illustrate exactly these kinds of corefiles.

Deadlock
This corefile demonstrates a locking bug that eventually causes the system to hang completely.

% scat 1
core file: /cores/reader_lock/vmcore.1
version: Generic_108528-02
machine: sun4u
hw_provider: Sun_Microsystems

Examining corefiles from a system hang 35


system type: SUNW,Ultra-4
time of crash: Thu Sep 28 13:09:18 MDT 2000 (core is 738 days old)
age of system: 6 days 4 hours 58 minutes 26.85 seconds
panic cpu: 2 (ncpus: 2)
panic string: zero

The above information doesn’t go very far in explaining the source of the panic. Below, the thread
summary command gives us a quick overview of what the state of the system was when the corefile was
generated. In this case, it provides us with clear direction for our next step.

% scat 1
scat4.0beta2(vmcore.1):0> thread summary
reference clock = panic_lbolt: 0x3324805
8 threads ran since 1 second before current tick (0 user, 8 kernel)
11 threads ran since 1 minute before current tick (0 user, 11 kernel)

0* TS_RUN threads
2 TS_STOPPED threads (0 user, 2 kernel)
19 TS_FREE threads (0 user, 19 kernel)
0 !TS_LOAD (swapped) threads

0 threads sleeping on a mutex


382* threads sleeping on a rwlock (124 user, 258 kernel)
37 threads sleeping on a condition variable (1 user, 36 kernel)
2* threads sleeping on a semaphore (0 user, 2 kernel)
0 threads sleeping on a user-level sobj
0 threads sleeping on a shuttle (door)

0 threads in biowait()

0 threads in dispatch queues


1* interrupt threads running (0 user, 1 kernel)
1 thread_reapcnt
9 lwp_reapcnt

446 total threads in allthreads list (125 user, 321 kernel)


455 nthread

Of only 455 threads running on the system, 382 of them are asleep waiting for another thread to release
an rwlock. Corefiles caused by rwlock hangs can be difficult to diagnose because the information about
the resource owner is only recored when the lock is acquired for writing, and read lock acquisition is the
much more common case. The next step is to examine the threads that are waiting on this resource.

scat4.0beta2(vmcore.1):3> tlist -t sobj rwlock

382 threads with that sobj found.

36 Solaris Corefile Analysis Tool


top mutex/rwlock owners:
count thread
382 read-locked rwlocks (count:rwlock 376:0x30001af9020 4:0x300052059f8
2:0x78071860)

The above information shows that almost all of the threads waiting to acquire an rwlock on this system
are waiting for the same lock. If the lock was acquired as a write lock, we can immediately get the owner,
and examine that thread to determine why it isn’t releasing the lock. If the lock was acquired as a read
lock, Solaris CAT provides an educated guess about what thread is holding this lock, and causing the
problem.

scat4.0beta2(vmcore.1):4> rwlock 0x30001af9020


read locked: holdcnt: 1
waiters: true writewanted: true

Above, we see that the lock is being held as a read lock, so we need Solaris CAT to help determine possible
owners of this lock. In order to do this, the stack of every thread on the system is examined to see if the
lock address (the wchan) is there. From this list, the threads that are waiting for the lock may be excluded
because a thread tries to acquire a lock it already owns will panic the system. The remainder are threads
likely to own the lock in question.

scat4.0beta2(vmcore.1):5> rwlock -L 0x30001af9020


read locked: holdcnt: 1
waiters: true writewanted: true
turnstile @ 0x30002120b80
ts_next: 0x0 ts_free: 0x3000994a900 ts_sobj: 0x30001af9020
ts_waiters: 371 ts_epri: 22 ts_inheritor: 0x0
ts_prioinv: 0x0
writer sleepq:
thread pri idle pid wchan command
0x300067f4d80 30 17m30.32s 25887 0x30001af9020 /usr/local/bin/perl /
private/etc/setquota 17267 /nfs/stak/u1/l/lekha
reader sleepq:
thread pri idle pid wchan command
0x3000490a820 60 17m30.15s 25877 0x30001af9020 /private/samba/bin/smbd
0x2a100eb1d40 60 17m29.81s 0 0x30001af9020 sched
0x2a101271d40 60 17m29.49s 0 0x30001af9020 sched
0x2a1001c5d40 60 17m29.49s 0 0x30001af9020 sched
...
0x30007da8a80 24 7m19.66s 26805 0x30001af9020 /private/samba/bin/smbd
0x30007c6ca80 18 9m10.80s 26607 0x30001af9020 /private/samba/bin/smbd
0x3000a59fa40 0 2m52.88s 27045 0x30001af9020 reboot
building list of rwlock waiters...found 371
building list of segkp threads...setting up segkp lists
found 372
possible read-lock owners:
thread pri idle pid wchan command

Examining corefiles from a system hang 37


0x30007cb62c0 32 10m30.31s 26451 0x300052059f8 /private/samba/bin/smbd

Only one thread had the lock address in it’s stack and wasn’t on the sleepq.

scat4.0beta2(vmcore.1):7> thread 0x30007cb62c0


==== user thread: 0x30007cb62c0 pid: 26451 ====
cmd: /private/samba/bin/smbd
t_wchan: 0x300052059f8 sobj: reader/writer lock owner: (reader locked)
t_stk: 0x2a101617af0 sp: 0x2a101616be1 t_stkbase: 0x2a101614000
t_pri: 32(TS)
t_procp: 0x30007712a98 p_as: 0x300068d0208 hat: 0x300077679f8 cnum: 0x1ad6
size: 5881856 rss: 3989504
idle: 63031 ticks (10 minutes 30.31 seconds)
start: Thu Sep 28 12:58:47 2000
age: 631 seconds (10 minutes 31 seconds)
stime: 53563854 (10 minutes 30.31 seconds earlier)

pc: 0x10109120 genunix:turnstile_block+0x544: call unix:swtch

-- on user thread’s stack --


genunix:turnstile_block+0x544
unix:rw_enter_sleep+0x17c
unix:rw_enter - frame recycled
ufs:ufs_create+0x1dc
lofs:lo_create+0x38
genunix:vn_create+0x438
genunix:vn_open+0xd4
genunix:copen+0x94
unix:syscall_trap32+0xa8
-- switch to user thread’s user stack --

The above thread is also asleep because another thread is holding an rwlock that it requires. The exact
same procedure can be used to determine what thread is holding this one back.

scat4.0beta2(vmcore.1):8> rwlock -L 0x300052059f8


read locked: holdcnt: 2
waiters: true writewanted: true
turnstile @ 0x30008530c48
...
possible read-lock owners:
thread pri idle pid wchan command
0x30007cf0ac0 60 14m0.57s 26098 0x30001af9020 /private/samba/bin/smbd
0x3000604e320 60 12m50.50s 26219 0x30001af9020 /private/samba/bin/smbd

In this case, two threads are holding this lock, which is allowed for readers. Both of these threads are
waiting on a lock that we’ve seen before. Thread 0x30007cb62c0 has 0x30001af9020 as a reader, and is
sleeping waiting for rwlock 0x300052059f8 to come free. rwlock 0x300052059f8 is owned as a reader by

38 Solaris Corefile Analysis Tool


0x30007cf0ac0 and 0x3000604e320. Both of these threads are sleeping waiting for lock 0x30001af9020 to
come free. None of these threads will ever wake up, because the resources they require are each held by
the other. They are caught in deadlock. This particular instance was matched to bug 4310608, which was
fixed in a later kernel patch.

Memory exhaustion
In some cases, a system might be inaccessible because of a critical resource shortage. In the following
corefile, the system was up, but nothing could be done on it; every command typed produced a fork
failed: not enough space error message.

% scat 5
...
core file: /cores/memory/vmcore.5
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)
boothowto: 0x40 (DEBUG)
time of crash: Mon Oct 7 08:19:08 MDT 2002
age of system: 48 minutes 53.71 seconds
panic cpu: 2 (ncpus: 2)
panic string: sync initiated

scat4.0beta2(vmcore.5):0> thread summary


reference clock = panic_lbolt: 0x4754b
0* threads ran since 1 second before current tick
0 threads ran since 1 minute before current tick
0 threads ran since 5 minutes before current tick
last thread ran 48 minutes 41.71 seconds earlier

0* TS_RUN threads
1 TS_STOPPED threads (1 user, 0 kernel)
0 TS_FREE threads
1* !TS_LOAD (swapped) threads (0 user, 1 kernel)

0 threads sleeping on a mutex


0 threads sleeping on a rwlock
0 threads sleeping on a condition variable
0 threads sleeping on a semaphore
0 threads sleeping on a user-level sobj
0 threads sleeping on a shuttle (door)

Examining corefiles from a system hang 39


0 threads in biowait()

2 threads in dispatch queues (1 user, 1 kernel)


2 thread_reapcnt
4 lwp_reapcnt

2 total threads in allthreads list (1 user, 1 kernel)


252 nthread

There wasn’t much in the summary to give us an idea about the nature of the problem, however the fork
failed messages indicate we might be dealing with a memory problem, as does the message buffer.

scat4.0beta2(vmcore.5):2> msgbuf
...
WARNING: /tmp: File system full, swap space limit exceeded

panic[cpu2]/thread=2a1001b1d20: sync initiated


...

scat4.0beta2(vmcore.5):3> proc sort swresv


addr pid ppid uid size rss swresv time command
------------- ------ ------ ------ ---------- -------- -------- ------ ------
---
0x30006a400b0 100686 100528 83750 1.93G 624K 1.93G 61693 leak
0x300043f8a88 100385 100380 60001 78.2M 13.1M 17.4M 119 /usr/
java1.2/bin/../jre/bin/../bin/sparc/native_threads/java org.apache.jserv.J
0x300074f94c8 100633 100627 83750 40.5M 29.2M 10.8M 991 /usr/
dist/share/netscape,v7.0beta/netscape-bin -UILocale en-US -contentLocale U
0x3000442b480 100398 100351 83750 248M 37.1M 8.23M 657 /usr/
openwin/bin/Xsun :0 -nobanner -dev /dev/fb0 -dev /dev/fb1 -auth /var/dt/A:
0x30006952aa8 100497 100488 83750 20.9M 7.85M 2.77M 320 dtwm
0x30004411488 100381 100354 60001 5.39M 3.66M 1.90M 5 /usr/
apache/bin/httpd
0x30006976aa0 100634 100354 60001 5.37M 3.22M 1.89M 3 /usr/
apache/bin/httpd
0x300043f9490 100384 100354 60001 5.36M 2.87M 1.88M 6 /usr/
apache/bin/httpd

By looking at the process list sorted by the reserved swap size, we can see that user 83750 launched a
process called leak that caused the memory exhaustion on this system. The meminfo command can be
used to show that this process is using all the swap on this system.

scat4.0beta2(vmcore.5):4> meminfo
...
initial swap available for reservation 275459 pages (2.10G)
k_anoninfo.ani_max + MAX((availrmem_initial - swapfs_minfree), 0)
current swap available for reservation 549 pages (4.28M)

40 Solaris Corefile Analysis Tool


(k_anoninfo.ani_max - k_anoninfo.ani_phys_resv) +
MAX((availrmem - swapfs_minfree), 0)
...

Runaway processes
Solaris systems impose a limit on the number of processes that can simultaneously exist on a system. If a
process inadvertently starts spawning children, and hits the limit, no other regular user can spawn new
processes. A limited number of extra processes are available to root for the purpose of cleaning up, but
often, a corefile will be generated when users cannot continue to work.

% scat 4
...
core file: /cores/fork/vmcore.3
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)
boothowto: 0x40 (DEBUG)
time of crash: Sun Oct 6 18:19:49 MDT 2002
age of system: 3 days 6 hours 21 minutes 15.03 seconds
panic cpu: 2 (ncpus: 2)
panic string: sync initiated

running sanity checks.../etc/system...dump flags...misc...done


scat4.0beta2(vmcore.3):0> msgbuf
...
NOTICE: out of per-user processes for uid 83750
NOTICE: out of per-user processes for uid 83750
NOTICE: out of per-user processes for uid 83750
NOTICE: out of per-user processes for uid 83750

panic[cpu2]/thread=2a1001b1d20: sync initiated


...

How many per-user processes are there on this system?

scat4.0beta2(vmcore.3):9> rdh maxuprc


0x183f9e0 unix:maxuprc+0x0 = 0x1705

How many processes are there on this system?

scat4.0beta2(vmcore.3):7> rdh nproc

Examining corefiles from a system hang 41


0x1839ae8 unix:nproc+0x0 = 0x1705

So we now know what the problem is, we just need to find out why it happened.

scat4.0beta2(vmcore.3):1> proc tree


0 sched
3 fsflush
2 pageout
1 /etc/init -
119631 /bin/sh /usr/dist/pkgs/netscape,v7.0beta/bin/netscape -install
119641 /bin/ksh -p /usr/dist/share/netscape,v7.0beta/netscape -install
119651 /bin/sh /usr/dist/share/netscape,v7.0beta/run-mozilla.sh /usr/
dis
t/share/netsca
119657 /usr/dist/share/netscape,v7.0beta/netscape-bin -UILocale en-US
-contentLocale U
119352 xmms
119193 /usr/local/bin/Eterm -x -O --scrollbar=off -g 80x66+700+0 -c grey -F
courier --
119201 -zsh
120298 ./fork
120300 ./fork
120302 ./fork
120304 ./fork
125482 ./fork
120306 ./fork
120308 ./fork
120311 ./fork
120313 ./fork
120315 ./fork
125726 ./fork
125498 ./fork
125725 ./fork
125389 ./fork
125493 ./fork
125411 ./fork
125418 ./fork
120317 ./fork
125401 ./fork
120319 ./fork
120321 ./fork
120323 ./fork
125483 ./fork
120325 ./fork
125485 ./fork
120327 ./fork
125379 ./fork

42 Solaris Corefile Analysis Tool


120329 ./fork
120331 ./fork
125387 ./fork
...

It doesn’t require too much more digging to see that UID 83750 launched a process that consumed all the
available processes in the system.

scat4.0beta2(vmcore.3):4> proc 119201


addr pid ppid uid size rss swresv time command
------------- ------ ------ ------ ---------- -------- -------- ------ ------
---
0x3000f92caf0 119201 119193 83750 2957312 73728 868352 40 -zsh
thread: 0x3000436eec0 state: slp wchan: 0x3000436f04e sobj: condition
var (from genunix:sigsuspend+0x9c)

Examining corefiles from a system hang 43

You might also like