Professional Documents
Culture Documents
SolarisTM Crash Analysis Tool
SolarisTM Crash Analysis Tool
October 2002
Copyright © 2002 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved.Patents Pending.U.S.
Government Rights - Commercial software. Government users are subject to the Sun Microsystems, Inc. standard license agreement and
applicable provisions of the FAR and its supplements.
Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in
the U.S. and in other countries, exclusively licensed through X/Open Company, Ltd.
Sun, Sun Microsystems, the Sun logo, Java and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and
other countries.
All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other
countries. Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems, Inc.
Products covered by and information contained in this service manual are controlled by U.S. Export Control laws and may be subject to the
export or import laws in other countries. Nuclear, missile, chemical biological weapons or nuclear maritime end uses or end users, whether
direct or indirect, are strictly prohibited. Export or reexport to countries subject to U.S. embargo or to entities identified on U.S. export exclusion
lists, including, but not limited to, the denied persons and specially designated nationals lists is strictly prohibited.
DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT,
ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Please
Recycle
Contents
Preface 5
Contents 3
4 Examining corefiles from a system hang 35
What am I looking for? 35
Deadlock 35
Memory Exhaustion 39
Runaway Processes 41
4 BookTitle
Preface
The SolarisTM Crash Analysis Tool (Solaris CAT) is an analysis tool for examining post mortem corefiles
and running Solaris kernels. The purpose of this document is to familiarize the reader with it’s
capabilities by providing annotated examples of some more common Solaris CAT uses. Some knowledge
of Solaris internals along with Solaris system administration is assumed.
This document assumes a basic knowledge of SPARC architecture and assembly language, as well as
Solaris internals. See “Related Documentation” for helpful references in these areas.
5
Typographic Conventions
Typeface Meaning Examples
AaBbCc123 Book titles, new words or Read Chapter 6 in the User’s Guide.
terms, words to be emphasized These are called class options.
You must be superuser to do this.
Shell Prompts
Shell Prompt
C shell machine_name%
C shell superuser machine_name#
Bourne shell and Korn shell $
Bourne shell and Korn shell superuser #
Related Documentation
■ Drake, Chris and Brown Kimberly. PANIC! UNIX System Crash Dump Analysis.
Preface 7
CHAPTER 1
Since you’re reading this, it seems likely that the package installation instructions are a little redundant
So now that you’ve got everything installed, lets look at what Solaris CAT can do.
To look at a corefile, simply invoke the tool with the corefile number you’re interested in:
# cd /var/crash/moonstone
# /opt/SUNWscat/bin/scat 0
[copyright message omitted]
core file: /var/crash/moonstone/vmcore.0
user: Super-User (root:0)
release: 5.9 (64-bit)
version: Generic
machine: sun4u
node name: moonstone
domain: foo.com
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
hostid: 80800000
boothowto: 0x40 (DEBUG)
time of crash: Mon Aug 19 16:55:51 MDT 2002 (core is 38 days old)
age of system: 20 days 9 hours 9 minutes 31.92 seconds
panic cpu: 0 (ncpus: 2)
panic string: sync initiated
scat4.0beta2(vmcore.0):0>
The above command leaves you at a prompt that works very much like a shell. The information printed
before the initial prompt is gathered from the corefile and presented here, because it’s information you’d
likely want at the start. The prompt indicates which file you’re currently looking at, and gives a history
number. The most interesting piece of information above is the panic string. Here, it indicates that the
corefile was generated by typing ‘sync’ at the ok prompt. In older releases of Solaris, the panic string
would simply have bee ‘zero.’
To use Solaris CAT to examine a currently running kernel simply invoke it with no arguments:
# /opt/SUNWscat/bin/scat
scat4.0beta2(live):1>
Note that there is no panic string, and the prompt is reminding you that you’re looking at the running
kernel.
% cat ~/.scatinit
set -o vi
alias more=less
alias l=’ls -l’
alias lr=’ls -lart’
scatenv dis_synth_only on
scatenv thr_flags off
scatenv thr_lwp off
scatenv thr_cpu off
scatenv thr_syscall off
scatenv stk_switch on
The scatenv command modifies the behavior of Solaris CAT, and will explain what all the variables do
when launched with no arguments. The ~/.scatstartup file is used to run commands at startup time, after
the core image has been opened. For example if you wanted to automatically dump out the message
buffer for every corefile you opened, simply do this.
% cat ~/.scatstartup
msgbuf
Note – These two startup files were split from the previously used ~/.scatrc file.
Users who have a ~/.scatrc file will need to split the contents out accordingly.
Getting Help
Solaris CAT has several resources available to help explore the command set.
Online Help
The online help will provide a list of all commands when the help command is invoked without options.
The help engine also supports regular expressions with the help -? regex syntax.
HTML documentation
These documents (starting with guide.html) demonstrate all the commands and are available in the
SUNWscat package.
Email alias
The SolarisCAT_Feedback@sun.com email alias is available for questions, bug reports and requests for
enhancements.
http://supportforum.sun.com/solariscat
This chapter is a guide to reading and (carefully) writing values in a running kernel. The purpose of
putting this chapter before crashes and hangs is that many users will not have corefiles available to
practice on. Writing vaules to a running kernel is a great way to make some; we can examine them later.
Consider the rootvp variable, which is a pointer to a struct of type vnode. On a 32 bit kernel, a pointer is
4 bytes long, whereas on a 64 bit kernel a pointer is twice that length. Solaris CAT does some of the work
here for you; the rd command will read one “word” from memory. This corresponds to four bytes on a
32 bit kernel, and eight on a 64 bit kernel.
scat4.0beta2(live):19> rd rootvp
0x10438e28 genunix:rootvp+0x0 = 0x70451f04
The first example above, from a 32 bit system, shows the address we’re interested in as a four byte
quantity. The second example comes from a 64 bit system, and is an eight byte quantity. The commands
rd32 and rd64 would have produced the same output for the first and second examples shown above
respectively.
So now we know where rootvp is in memory, let’s dump out the memory starting at that address as a
vnode structure:
scat4.0beta2(live):20> sdump 0x70451f04 vnode
{
v_lock = {
_opaque = [ NULL 0xbaddcafe ]
}
v_flag = 0x0
v_count = 0x1
v_vfsmountedhere = NULL
v_op = 0x1044008c (specfs:spec_vnodeops+0x0)
v_vfsp = NULL
v_stream = NULL
v_pages = NULL
v_type = 0x3 (undefined enum constant)
v_rdev = 0x3740000 (221(vxio),0)
v_data = 0x70451f00
v_filocks = NULL
v_shrlocks = NULL
v_cv = {
_opaque = 0x0
}
v_locality = 0xbaddcafe
}
The above command could also have been written sdump *rootvp vnode and produced the same output.
Without the source code, it is more difficult to make sense of a corefile, but the header files found in /usr/
include are a great help. In the example above, we saw a vnode, perhaps for the first time. To find out
more, we could look in the header files.
From there, we can look in /usr/include/sys/vnode.h for the type definition of a vnode (awfully
formatted to show one element per line):
Each of the structs shown above are also documented similarly under the /usr/include directory, and
many of the header files have comments like those above explaining what the different elements of the
struct do.
In addition to the files in /usr/include, Solaris CAT provides a means for showing the elements of a
structure: The stype command:
# /opt/SUNWscat/bin/scat --write
...
scat4.0beta2(live):0> rd memscrub_verbose
0x140b904 unix:memscrub_verbose+0x0 = 0x0
scat4.0beta2(live):1> wr memscrub_verbose 1
scat4.0beta2(live):2> rd memscrub_verbose
0x140b904 unix:memscrub_verbose+0x0 = 0x1
At this point, the astute reader will note that the above was done without knowing the type of
memscrub_verbose, and was therefore a very dangerous operation indeed. We wrote the value of 0x1,
but how many bytes did we write? The answer in this case is the default, a word. This is a very bad thing
for us, as memscrub_verbose is an unsigned int (4 bytes) and we’ve now written 8 bytes, because this is a
64 bit kernel.
You can see that we’ve poked a 0x1 into the wrong spot. After correcting this problem, and thanking the
kernel gods that we got away with a colossal blunder, we issue the correct command this time, not taking
the default size:
Note that the rd command now appears to give the wrong answer, but that’s simply because it’s the
wrong tool for the job. The rdh output gives us what we want, as we can verify by looking at the messages
file.
% tail -1 /var/adm/messages
Sep 30 19:55:40 moonstone unix: [ID 217272 kern.notice] NOTICE: Memory scrubber
read 0x2 pages starting at 0xbf000000
The moral of the story is that writing to the running kernel is a dangerous business, and you probably
shouldn’t do it unless you’re prepared to deal with the possible fallout. If you must do it, keep in mind
that there are various tricks to prevent you from writing the wrong amount of data at your target address.
Use rdb and mdump to examine memory one byte at a time. Use calc "sizeof(uint_t)" to
determine how many bytes make up a uint_t, another type, or even a struct. Last but not least, use the
information in the header files and in the Tunables Parameters Reference Manual (available at http://
docs.sun.com/?p=/doc/806-7009/) to make the most informed decisions possible.
Method one
While the system is running, abruptly bring it to the ok> prompt by using the Stop-A key sequence or by
sending a BREAK signal to the serial port. Once at the ok> prompt, typing sync will cause the system to
reboot, and produce a corefile.
# /opt/SUNWscat/bin/scat --write
...
opening /dev/ksyms /dev/kmem ...symtab...core...done
loading core data: modules...panic...memory...time...misc...done
loading stabs...patches...done
Shortly after this, the system will panic and produce a corefile useful for practicing the commands
illustrated in the follwing chapters.
Method three
A third method of corefile creation exists for systems that have a dedicated dump device. A dedicated
dump device is a partition set aside for crash dumps that is not also a swap device. These systems can use
the savecore -L command to produce a corefile without bringing the system down. This functionality
was intorduced in Solaris 7. One drawback to this method is that the core image is collected without
stopping the running kernel, and so inconsistencies may appear in the corefile. For example, it may be
difficult to walk a linked list that was being updated at the time.
This chapter deals with the analysis of corefiles produced when a system calls the panic routine. If you
created a corefile with method two described at the end of Chapter two, it can be used to follow along
with the examples given below.
% scat 0
files: /cores/checkpage/*.0
release: 5.6
version: Generic_105181-09
machine: sun4u
hw_provider: Sun_Microsystems
Without running any commands, we don’t know too much about this problem. The system paniced with
a bad trap, but that could be virtually anything. We need more information. Note that the output of
Solaris CAT may appear differently based on the contents of ~/.scatinit. The output that follows was
produced with the .scatinit file shown in Chapter 1.
scat4.0beta2(vmcore.0):0> panic
panic on cpu 5
panic string: trap
==== panic kernel thread: 0x3021fe80 pid: 2 on cpu: 5 ====
cmd: pageout
The panic command displays the stack of the panic thread, along with various bits of helpful information
about that thread. The command thread *panic_thread would deliver the same results. Reading
the stack from the bottom up, we can tell we’re looking at the pageout command, that the routine
pageout_scanner called checkpage, and something went terribly wrong in the checkpage routine.
The trap frame tells us that the %pc register contained the instruction lduh [%o0 + 0x8], %o0 when
things went bad. Looking at this instruction, we can see that we were trying to take the contents of register
%o0, adding eight, dereferencing an unsigned halfword, and putting the result back into %o0. Further,
the trap frame shows us that %o0 was 0x0 at the time; the scanner was trying to dereference memory
address 0x8, and 0x8 isn’t a good address. So now, the questions becomes why was %o0 NULL? At
thispoint, we should check to see if this is a known bug. Plugging the string “checkpage
pageout_scanner” into sunsolve brings up bug 4188132, which was closed as a duplicate of 4169509. Bug
4169509 was fixed in kernel patch 105181-13, and we already know from the Solaris CAT preamble that
this system was behind on patches.
Filesystem corruption
The purpose of calling panic and creating a corefile is to stop processing once the kernel determines that
continuing could cause further problems. An example of this is trying to free an object that is already in
that state. Let’s look at a corefile that illustrates this.
% scat 9
files: /cores/freeing/*.9
release: 5.6
version: Generic_105181-06
machine: sun4d
hw_provider: Sun_Microsystems
system type: SUNW,SPARCserver-1000
time of crash: Fri Mar 12 06:44:23 MST 1999 (core is 1298 days old)
age of system: 14 minutes 18.53 seconds
panic cpu: 5 (ncpus: 6)
panic string: free: freeing free frag, dev:0x800080, blk:232, cg:120,
ino:683524, fs:/disk4
running sanity checks.../etc/system...rmap...dump flags...misc...done
scat4.0beta2(vmcore.9):0> panic
The second category can be more complicated, and involves failing hardware or misconfiguration. In
these cases, it’s important to use the msgbuf command to examine the message buffer for HW errors. The
/var/adm/messages file should also be examined. If there are no obvious signs of a hardware problem,
then it is likely that there is a configuration problem; check for a swap partition that overlaps a mounted
filesystem. It’s also worth checking that the system is up to date on related patches like ufs, scsi, sd, etc...
In the case above, the disk containing the /disk4 filesystem was replaced, and the problem was not seen
again.
Bad Traps
Traps happen all the time while the system is running. For example system calls, and references to pages
not currently in main memory cause traps. Normally, this isn’t a problem, but when something goes
wrong in this operation, the system must panic and generate a corefile. Here is an example of a trap that
went bad when the virtual address requested could not be found while in privileged mode.
% scat 0
files: /cores/badtrap/*.0
release: 5.6
version: Generic_105181-20
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-Enterprise
boothowto: 0x1000 (RECONFIG)
time of crash: Wed Nov 15 22:37:12 MST 2000 (core is 684 days old)
age of system: 7 days 2 hours 14 minutes 59.24 seconds
panic cpu: 10 (ncpus: 2)
panic string: trap
Reading this stack from the bottom up, we can see that things are moving along smoothly until we’re deep
into a routine called blkdone. The blkdone routine is actually a subroutine within another routine called
bcopy.
A little searching through /usr/include tells us that the bcopy routine takes three arguments, a source
address, a destination address, and a number of bytes to copy from source to destination. The question
now becomes what arguments does bcopy get from the routine that calls it? In order to find out, we need
to examine the instructions in fca_cmd_complete just prior to the transfer of control to bcopy. We can take
advantage of the fact that in the SPARC architecture, the %o registers of the calling routine become the %i
registers of the called routine to determine what arguments (the %i regs) bcopy was called with.
So we can see that the source address is %l0 + 0x18, the destination address is %i5 + 0xc, and the number
of bytes is the unsigned word at the address pointed to by %fp - 0x14. We’re almost there, we just need
to know what the contents of %l0, %i5 and %fp were when fca_cmd_complete was on the stack. Lucky
for us, Solaris CAT provides us with a command to do just that.
Now, we can use the values from the frame command to determine the arguments passed to bcopy.
scat4.0beta2(vmcore.0):4> rd 0x6506dfa4+0x18
0x6506dfbc = 0x70000200
scat4.0beta2(vmcore.0):5> rd 0x6787e600+0xc
0x6787e60c = 0x70000200
scat4.0beta2(vmcore.0):6> rd 0x30023ac8-0x14
0x30023ab4 = 0x60
Examining the 96 (or 0x60) bytes at the source address, we can see the problem.
The bcopy routine was told to copy 0x60 bytes from a buffer that is only 0x44 (0x6506e000-0x6506dfbc)
bytes long. The vendor of the fcaw driver produced a patch for this bug.
% scat 8
The cause of the panic can be seen by looking at the trap frame %pc, and examining the contents of
memory that instruction was attempting to dereference. Here, the system was attempting to dereference
the contents of %i1, which was 0x4190 at the time, and that is an invalid address. We can confirm this with
the msgbuf command.
scat4.0beta2(vmcore.8):3> msgbuf
Note that the addr= section of the BAD TRAP message confirms our hypothesis. The next step involves
trying to determine why that register contains the wrong value, but in this case, the trap frame holds more
information worth examining. Note that the %pc and %npc contain instructions from two different
functions, and neither of them is a transfer of control instruction. Examination of the code surrounding
both the %pc and %npc indicates that they cannot possibly be consecutive, even considering the “delay
slot” used with transfer of control instructions. One of them must be wrong. Given the previous function
on the stack, the %npc looks like it must be wrong.
It’s possible that the instructions were corrupted in memory, so dump them out from the corefile.
Since turnstile_interlock+0x8 is correct in memory, we need to determine how the %npc got loaded with
something else. Knowing what ought to be there helps in this search. The %pc contains 0x100e3ad8, and
SPARC instructions are four bytes long; therefore, the %npc should contain 0x100e3adc but instead
contains 0x100e7adc. How many bits difference are there between what is there, and what ought to be
ther?
Another handy command for getting at the same information is the flip command.
Stack Overflows
Each time one function calls another, the calling function must save enough of it’s state such that when
the callee returns, the caller can pick up where it left off. For almost every function in the kernel that gets
called, a “stack frame” is saved in memory. This can result in a problem if the number of functions called
% scat 0
...
core file: /cores/stack_overflow/vmcore.0
release: 5.7 (64-bit)
version: Generic_106541-12
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-Enterprise
time of crash: Thu Dec 21 13:02:00 MST 2000 (core is 653 days old)
age of system: 1 days 18 hours 2 minutes 25.63 seconds
panic cpu: 1 (ncpus: 16)
panic string: Kernel panic at trap level 2
Before even typing the first command for this corefile, we have two decent clues that suggest this may be
a stack overflow corefile. First is the panic string; this does not by any means indicate a definite stack
overflow, but it is the most common panic string for this type of panic. The second clue is that the /etc/
system sanity checks failed. The variables lwp_default_stksize and rpcmod:svc_run_stksize control the
amount of space a kernel thread is alloted for stack growth. The system administor of this system set these
parameters twice. When the /etc/system file is read, if a value is defined twice, the line nearest the
bottom of the file is used. Here, the administrator set the values twice, and had the misfortune of putting
the value that would have prevented this panic first.
The output from the panic command is not going to be overly useful here. The nature of the panic
prevents the tool from showing the full stack correctly.
scat4.0beta2(vmcore.0):6> panic
panic on cpu 1
panic string: Kernel panic at trap level 2
==== panic user thread: 0x30014c5afc0 pid: 15744 on cpu: 1 ====
cmd: cp -r /u01/app/oracle/admin/devTAP /u01/app/oracle/admin/stestAT
-- on ptl1_stk --
unix:complete_panic+0x20
unix:do_panic+0x150
unix:panic+0x1c
unix:user_rtt+0x0
unix:page_get_freelist - frame recycled
unix:page_create_va+0x328
-- switch to user thread’s stack --
unix:segkmem_alloc+0x38
-- error reading next frame @ 0x0 --
In order to get a more accurate picture of what was happening, use the findstk command.
scat4.0beta2(vmcore.0):7> findstk
...
==== stack @ 0x2a1013f02b0 (sp: 0x2a1013efab1) ====
-- on user panic_thread’s stack --
0x0(genunix:kmem_slab_create?)
genunix:kmem_cache_alloc_global+0x40
genunix:kmem_cache_alloc+0x144
genunix:kmem_alloc+0x2c
unix:kalloca+0x5c
unix:i_ddi_mem_alloc+0x154
unix:i_ddi_mem_alloc_lim+0x80
genunix:ddi_iopb_alloc+0x6c
fcaw:fca_dma_zalloc+0x29c
fcaw:fca_pkt_dmactl+0x1acc
fcaw:fca_tran_init_pkt+0x228
scsi:scsi_init_pkt+0x54
sd:make_sd_cmd+0x130
sd:sdstart+0x114
sd:sdstrategy+0x3a4
emcp:PowerPlatformBottomDispatch+0x270
emcp:PowerDispatch+0x150
emcpcg:CgDispatch+0x1e0
emcp:PowerDispatch+0x13c
The sheer length of this stack is another indication of a stack overflow. Note that the trap frame near the
bottom of the stack is not a bad trap, but rather an access to an alternate address space. Solaris CAT prints
out the trap frame for this, although it isn’t related to the problem at hand. More hints can be gleaned by
looking at what the system was doing at trap level 1. Recall that the panic string indicated the system
paniced at trap level two. Using the ptl1 argument to the panic command shows us data specific to the
different trap levels.
Note that at TL1, the %pc is trying to write a value onto the stack; the presence of the %sp (stack pointer)
register in the instruction indicates this. Another hint that this is a stack overflow is the %pc at TL2.
During a stack overflow, it will often be in have_win.
This chapter deals with the analysis of corefiles produced when a system is hung and unresponsive.
Corefiles produced using method one from chapter two can be used to follow along with the examples in
this chapter, although a corefile taken from a normally running system is significantly less interesting to
look at.
Deadlock
This corefile demonstrates a locking bug that eventually causes the system to hang completely.
% scat 1
core file: /cores/reader_lock/vmcore.1
version: Generic_108528-02
machine: sun4u
hw_provider: Sun_Microsystems
The above information doesn’t go very far in explaining the source of the panic. Below, the thread
summary command gives us a quick overview of what the state of the system was when the corefile was
generated. In this case, it provides us with clear direction for our next step.
% scat 1
scat4.0beta2(vmcore.1):0> thread summary
reference clock = panic_lbolt: 0x3324805
8 threads ran since 1 second before current tick (0 user, 8 kernel)
11 threads ran since 1 minute before current tick (0 user, 11 kernel)
0* TS_RUN threads
2 TS_STOPPED threads (0 user, 2 kernel)
19 TS_FREE threads (0 user, 19 kernel)
0 !TS_LOAD (swapped) threads
0 threads in biowait()
Of only 455 threads running on the system, 382 of them are asleep waiting for another thread to release
an rwlock. Corefiles caused by rwlock hangs can be difficult to diagnose because the information about
the resource owner is only recored when the lock is acquired for writing, and read lock acquisition is the
much more common case. The next step is to examine the threads that are waiting on this resource.
The above information shows that almost all of the threads waiting to acquire an rwlock on this system
are waiting for the same lock. If the lock was acquired as a write lock, we can immediately get the owner,
and examine that thread to determine why it isn’t releasing the lock. If the lock was acquired as a read
lock, Solaris CAT provides an educated guess about what thread is holding this lock, and causing the
problem.
Above, we see that the lock is being held as a read lock, so we need Solaris CAT to help determine possible
owners of this lock. In order to do this, the stack of every thread on the system is examined to see if the
lock address (the wchan) is there. From this list, the threads that are waiting for the lock may be excluded
because a thread tries to acquire a lock it already owns will panic the system. The remainder are threads
likely to own the lock in question.
Only one thread had the lock address in it’s stack and wasn’t on the sleepq.
The above thread is also asleep because another thread is holding an rwlock that it requires. The exact
same procedure can be used to determine what thread is holding this one back.
In this case, two threads are holding this lock, which is allowed for readers. Both of these threads are
waiting on a lock that we’ve seen before. Thread 0x30007cb62c0 has 0x30001af9020 as a reader, and is
sleeping waiting for rwlock 0x300052059f8 to come free. rwlock 0x300052059f8 is owned as a reader by
Memory exhaustion
In some cases, a system might be inaccessible because of a critical resource shortage. In the following
corefile, the system was up, but nothing could be done on it; every command typed produced a fork
failed: not enough space error message.
% scat 5
...
core file: /cores/memory/vmcore.5
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)
boothowto: 0x40 (DEBUG)
time of crash: Mon Oct 7 08:19:08 MDT 2002
age of system: 48 minutes 53.71 seconds
panic cpu: 2 (ncpus: 2)
panic string: sync initiated
0* TS_RUN threads
1 TS_STOPPED threads (1 user, 0 kernel)
0 TS_FREE threads
1* !TS_LOAD (swapped) threads (0 user, 1 kernel)
There wasn’t much in the summary to give us an idea about the nature of the problem, however the fork
failed messages indicate we might be dealing with a memory problem, as does the message buffer.
scat4.0beta2(vmcore.5):2> msgbuf
...
WARNING: /tmp: File system full, swap space limit exceeded
By looking at the process list sorted by the reserved swap size, we can see that user 83750 launched a
process called leak that caused the memory exhaustion on this system. The meminfo command can be
used to show that this process is using all the swap on this system.
scat4.0beta2(vmcore.5):4> meminfo
...
initial swap available for reservation 275459 pages (2.10G)
k_anoninfo.ani_max + MAX((availrmem_initial - swapfs_minfree), 0)
current swap available for reservation 549 pages (4.28M)
Runaway processes
Solaris systems impose a limit on the number of processes that can simultaneously exist on a system. If a
process inadvertently starts spawning children, and hits the limit, no other regular user can spawn new
processes. A limited number of extra processes are available to root for the purpose of cleaning up, but
often, a corefile will be generated when users cannot continue to work.
% scat 4
...
core file: /cores/fork/vmcore.3
machine: sun4u
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-60
kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)
boothowto: 0x40 (DEBUG)
time of crash: Sun Oct 6 18:19:49 MDT 2002
age of system: 3 days 6 hours 21 minutes 15.03 seconds
panic cpu: 2 (ncpus: 2)
panic string: sync initiated
So we now know what the problem is, we just need to find out why it happened.
It doesn’t require too much more digging to see that UID 83750 launched a process that consumed all the
available processes in the system.