Professional Documents
Culture Documents
ST350 - Sun Systems Fault Analysis Workshop - Ig - 0596
ST350 - Sun Systems Fault Analysis Workshop - Ig - 0596
ST-350
Please
Recycle
About This Course
Overview
The primary objective of this course is to learn a systematic fault
analysis technique to troubleshoot intermediate and some advanced
Solaris system faults.
iii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Course Prerequisites
Monday
Tuesday
Wednesday
Thursday
Friday
● Have the instructor initial you completed lab projects and fault
forms.
John Shedaker
Sun Microsystems
2550 Garcia Ave., MS UMIL06-01
Mountain View, CA 94043
The following table describes the type changes and symbols used in
this book.
Typeface or
Meaning Example
Symbol
Code samples are included in this book and may display the following:
Hardware Requirements
Troubleshooting
● Bad video cable, Part Number 530-1440; color CRT cable 1.2M
DB13W3 to DB13W3; pull out pin 5.
Workshop Environment
Normal Operation
Workshop Environment
● A white board in the lab during the fault workshops listing all the
teams and the status of all the faults. For example:
● When all the faults are done, the workstation becomes free for a
new fault, which will be added to the list.
● Ask the students to keep track of their progress by using the Fault
Tracker form in the Student Guide.
● Ensure that all students work all the faults before finally removing
them from the workstation.
Course Philosophy
Stress that this course deals with developing an approach to fixing
system errors. The students are not being measured in the number of
faults they fix, but rather the approach they used.
Stress, above all, that they are troubleshooting the fault and not what
the instructor has done to emulate the fault. If they begin to try and
out-guess the instructor, they will lose the point of attending the class.
Scripts
The script files have been written to operate from the
/opt/st350_scripts directory. You must modify the scripts if they
are loaded into another directory.
The script files are located with the ST-350 compressed tar file for this
course (available from the chocolate.ebay server). For contractors,
please contact the manager you report to for access to these files.
ftp forward.Ebay.sun.com
name: st350
passwd: st350
i
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Error Detection Overview ........................................................................2-1
Introduction ....................................................................................... 2-2
Error Types ........................................................................................ 2-3
Error Reporting Mechanisms .......................................................... 2-4
Bus Errors...................................................................................2-4
Interrupts for Reporting...........................................................2-4
Resets ..........................................................................................2-4
Type of Errors .................................................................................... 2-5
Software Errors..........................................................................2-5
Hardware-Corrected Errors ....................................................2-5
Recoverable Errors....................................................................2-5
Fatal Errors.................................................................................2-5
CPU Watchdog Reset ...............................................................2-6
System Watchdog Reset ...........................................................2-6
Critical Errors ............................................................................2-6
Primary Buses.................................................................................... 2-7
Sun-4u ................................................................................................. 2-8
Memory Management Unit (MMU)............................................... 2-9
Number Base Conversion Chart ................................................... 2-10
Page Table Entry – Sun-4 Architecture ........................................ 2-11
Sun-4 PTE Format ...................................................................2-12
Examples of Valid PTEs .........................................................2-12
Page Table Entry – Sun-4c Architecture ...................................... 2-13
Sun-4c PTE Format .................................................................2-14
Examples of Valid PTEs .........................................................2-14
Page Table Entry – Sun-4m Architecture .................................... 2-15
Access Code .............................................................................2-16
Examples of Valid PTEs .........................................................2-16
Page Table Entry – Sun-4d Architecture...................................... 2-17
Access Code .............................................................................2-18
Example of Valid PTEs...........................................................2-18
Sun-4 Error Detection Workshop ................................................. 2-19
Sun-4c Error Detection Workshop................................................ 2-22
Example 1 .................................................................................2-23
Example 2 .................................................................................2-26
Sun-4m Error Detection Workshop.............................................. 2-27
Example 1 .................................................................................2-28
Example 2 .................................................................................2-31
Sun-4d Error Detection Workshop ............................................... 2-32
Example 1 .................................................................................2-33
Example 2 .................................................................................2-36
Skills Checklist................................................................................. 2-37
System Fault Status Register (sfsr) Format .......................2-41
POST Diagnostics ......................................................................................3-1
Diagnostics Overview ...................................................................... 3-2
Contents iii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Hardware Tests .........................................................................6-4
Additional References ..............................................................6-4
Installing SunVTS Software............................................................. 6-5
The SunVTS Graphical User Interface ........................................... 6-7
Selecting and Setting Up Tests ........................................................ 6-9
SunVTS Testing Options ................................................................ 6-10
Tests Switch ..................................................................................... 6-12
Option Files..............................................................................6-12
Running the SunVTS Tests ............................................................ 6-13
System Status Panel ................................................................6-13
Test Status Panel .....................................................................6-14
Performance Monitor Panel...................................................6-15
Reviewing SunVTS Test Results ................................................... 6-17
System Status Panel ................................................................6-17
Console Window Messages...................................................6-17
Log Files ...................................................................................6-18
Using SunVTS in TTY Mode ......................................................... 6-19
Negotiating the SunVTS TTY Interface ....................................... 6-20
Using SunVTS Remotely................................................................ 6-21
Kernel Interface .......................................................................6-21
User Interface...........................................................................6-21
Lab Overview .................................................................................. 6-24
Lab Objectives..........................................................................6-24
Equipment................................................................................6-24
Lab Tasks...........................................................................................6-25
SunSolve ......................................................................................................7-1
Overview ............................................................................................ 7-3
Distribution........................................................................................ 7-4
SunSolve Online Account ................................................................ 7-5
Installing SunSolve ........................................................................... 7-6
Installing SunSolve Using File Manager ...............................7-7
Installation GUI Window.........................................................7-8
Sharing SunSolve ....................................................................7-10
Starting Sunsolve ............................................................................ 7-11
Starting From an Installed Server .........................................7-11
Starting From the CD-ROM...................................................7-12
The SunSolve Window...........................................................7-12
Search Tool....................................................................................... 7-13
Configuring SunSolve ............................................................7-14
SearchTool Properties.............................................................7-15
Troubleshooting Using SearchTool .............................................. 7-16
Setting Up the Search .............................................................7-16
Keyword Logical Connectors................................................7-16
Starting the Search ..................................................................7-17
Datasets and Collections to Search............................................... 7-18
Contents v
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
kadb Workshop Introduction................................................8-41
kadb Description .....................................................................8-42
Invoking and Exiting kadb ....................................................8-43
Mapping UNIX Data Structures ...........................................8-44
Related Data Structures..........................................................8-46
Kernel Crash Dump Analysis Workshop 5................................. 8-49
Kernel Crash Dump Analysis Workshop 6................................. 8-51
Kernel Crash Dump Analysis Workshop 7................................. 8-52
Watchdog Reset Workshop 8 (Sheet 1 of 2) Optional................ 8-53
Bug Install ................................................................................8-53
Program Debugging – Optional ................................................... 8-55
ps Command Workshop 9—Optional ......................................... 8-57
Introduction .............................................................................8-57
Sequence of Procedures (Do Not Execute).................................. 8-58
Setting Base Level ...................................................................8-58
Acquiring Base-Level Information .......................................8-58
Tracing OpenWindows Processes ........................................8-58
Setting Base Level ........................................................................... 8-59
Acquiring Base-Level Information ............................................... 8-60
Base-Level Processes (1 of 2) ......................................................... 8-61
Tracing OpenWindows Processes ................................................ 8-63
Workshop Summary Exercise ....................................................... 8-64
Skills Checklist................................................................................. 8-66
Fault Tracker Progress Chart ..................................................................A-1
Fault Worksheets - Student Guide ........................................................ B-1
Requirements............................................................................ B-1
Resources................................................................................... B-1
System Configurations ............................................................ B-1
Fault Worksheet #1 - Blank Monitor ............................................. B-2
Fault Worksheet #2 - Device Error During Boot.......................... B-3
Fault Worksheet #3 - File Errors During Boot ............................. B-4
Fault Worksheet #4 - Incomplete Boot to Solaris
Operating System.......................................................................... B-5
Fault Worksheet #5 - Login Problem............................................. B-6
Fault Worksheet #6 - adb Macro Error.......................................... B-7
Fault Worksheet #7 - Feckless ........................................................ B-8
Fault Worksheet #8 - Incomplete Boot to Solaris Operating
System............................................................................................. B-9
Fault Worksheet #9 - Turn the Page ............................................ B-10
Fault Worksheet #10 - Login Problem......................................... B-11
Fault Worksheet #11 - Network Problem ................................... B-12
Fault Worksheet #12 - OpenWindows Problem ........................ B-13
Fault Worksheet #13 - Shutdown When Opening
Windows ...................................................................................... B-14
Fault Worksheet #14 - Network Printer Problem...................... B-15
Contents vii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Faults - Instructor Guide Only ...............................................................C-1
Fault #1 - Blank Monitor ................................................................. C-4
Fault #2 - Device Error During Boot.............................................. C-6
Fault #3 - File Errors During Boot.................................................. C-8
Fault #4 - Incomplete Boot to Solaris Operating System.......... C-10
Fault #5 - Login Problem............................................................... C-13
Fault #6 - adb Macro Error............................................................ C-15
Fault #7 - Feckless .......................................................................... C-17
Fault #8 - Incomplete Boot to Solaris Operating System.......... C-19
Fault #9 - Turn the Page ................................................................ C-21
Fault #10 - Login Problem............................................................. C-24
Fault #11 - Network Problem ....................................................... C-29
Fault #12 - OpenWindows Problem ............................................ C-31
Fault #13 - Shutdown When Opening Windows ........................ C-3
Fault #14 - Network Printer Problem .......................................... C-35
Fault #15 - Incomplete Boot to Solaris Operating System........ C-37
Fault #16 - Constant Power Down or Reboot Problem ............ C-39
Fault #17 - The ps Command Returns Nothing ........................ C-41
Fault #18 - Password.Not.Found ................................................. C-43
Fault #19 - Network Problem ....................................................... C-45
Fault #20 - OpenWindows Problem ............................................ C-47
Fault #21 - Banner Logo Has Been Changed.............................. C-49
Fault #22 - Do Not Tread on Me .................................................. C-51
Fault #23 - vi Editor Problem....................................................... C-54
Fault #24 - “Hacker” Intrudes the System.................................. C-56
Fault #25 - No OpenWindows Environment ............................. C-59
Fault #26 - Login Problem............................................................. C-61
Fault #27 - “Hangs” on Boot......................................................... C-63
Fault #28 -No Network.................................................................. C-65
Fault 29 - Where It Is At ................................................................ C-67
Fault #30 - Seedy ROM.................................................................. C-70
Fault #31 - See It Now.................................................................... C-73
Fault #32 - Cannot Log In as Root................................................ C-76
Fault #33 - No Network or Interface ........................................... C-78
Fault #34 - Script “Hangs” System .............................................. C-82
Fault #35 - No shcat...................................................................... C-84
Fault #36 - Login Problem............................................................. C-86
Fault #37 - Noel Two...................................................................... C-89
Fault #38 - Client-Server ftp Problem ........................................ C-91
Fault #39 - Network Problem ....................................................... C-93
Fault #40 - Slow and Fast Perceptions......................................... C-96
Fault #41 Cannot Boot Diskless Client ........................................ C-98
Fault #42 - Logs Out During OpenWindows Startup............. C-100
Fault #43 - Sorry User .................................................................. C-102
Fault #44 - No Window, Use SunSolve ..................................... C-104
Fault #45 - NIS+ Password.......................................................... C-106
Contents ix
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
x Sun Systems Fault Analysis Workshop
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Fault Analysis and Diagnosis 1
Objectives
Upon completion of this module, you will be able to:
References
Alamo Learning Systems AdvantEdge Analysis Program
1-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1
Introduction
You may be an expert. With the expert approach, you gather data and
use your experience and the experience of others to determine causes.
Fault analysis and diagnosis provides you with a powerful tool to
analyze data and focus on the likely causes of a complex problem or a
problem outside of your immediate experience.
Fault Analysis
1. State the problem.
3. Identify differences.
Diagnosis
5. Generate likely causes.
Test
Next likely cause No
Yes
Verify likely cause
Given a system problem, identify the object and its defect, and write a
problem statement. A problem statement answers these questions:
● Does the statement state the exact deviation from the norm?
Most bugs that become a disaster happen because the original problem
is not described correctly.
The next step in system fault analysis and diagnosis is to describe the
problem in detail.
Questions to Ask
Expand and customize a question list for your own style and
environment.
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
Fact Sources
● Customer complaints
● Dumps
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
Questions to Ask
● What similar object might have this defect but does not?
● What other defect could you see on the problem object but do not?
● Where else on the problem object could you see the defect but do
not?
● When could the defect have been first observed but was not?
● What other time in the object’s life cycle could the defect have
occurred but did not?
● In what other pattern could the defect have occurred but did not?
● How many of the objects might have been defective but are not?
● What other trend could have been observed but was not?
● Are the comparative facts as close and similar to the observed facts
as possible and yet not complete opposites?
Identifying Differences
Use the lists of observed facts and comparative facts to analyze and list
the differences.
● List only the differences that are unique between the observed and
comparative facts.
First you use differences and relevant changes to discover likely causes
of the problem. Then you form a hypothesis about the cause, and
analyze the problem with facts, differences, and relevant changes.
Then you can diagnose the problem.
State your hypothesis in the form of a question and an answer that can
be tested. For example:
How could the fault analysis element have caused this problem?
For the fault analysis element, insert one of the following possibilities:
● A relevant change
● A single difference
You can develop as many hypotheses as you have facts. Use your
experience and judgement to limit, initially, the list to the most logical
and likely cause(s). If your first hypothesis does not prove true, you
can return to this step.
Using the list of likely causes, test each one to determine the most
likely cause. Testing your likely causes increases the certainty that you
will discover the actual cause of the problem before you embark on or
recommend a potentially costly, time-consuming solution.
To test for the most likely cause, eliminate any cause that fails to
explain the observed and comparative facts.
Eliminate a likely cause only when you are certain it cannot be the true
cause of the problem.
Test each likely cause separately using the fault analysis worksheets.
Ask yourself whether the cause can support the facts, and mark a Y for
yes or N for no on the line under the fact number. For example:
Test each likely cause against each relevant fact and mark it Y or N. If
you must make an assumption or have a doubt about an answer, mark
it with a question mark (?). If you simply cannot make a
determination, leave it blank.
Now you are ready to verify, test, and prove that the most likely cause
is the actual cause of the problem.
To verify the most likely cause, use the method that is:
● Least disruptive
● Least expensive
● Least time-consuming
● Most conclusive
Verifying the most likely cause should remove all uncertainty about
the cause of a problem. Three methods that verify the most likely
cause of the problem include:
● Results – Assume, without proof, that the most likely cause you
choose is the actual cause, and take the indicated corrective action.
This is the least conclusive verification, and it can be disruptive,
expensive, and time-consuming, especially if your assumptions
are not correct.
Problem Statement
Window system hangs on systems using the GX+ video frame buffer.
Problem
Observed Facts Comparative Facts Differences
Description
1. What object (system) is Six systems using Not on other Sun Location and
defective? ss2GX+ video frame machines on this site, but environment,
buffer on other Sun machines temperature, humidity,
elsewhere dirt, power, static
2. What exactly is wrong? System "hangs" but can System does not crash or Operating system is still
remote login freeze; running; power cycle of
can sometimes fix by mouse may be related
removing or inserting
mouse
3. Where is the object Acme Industries; factory Not at other Acme sites, Environment, network,
(system) located? control units in other customers, or office vibration
manufacturing plant environment
4. Where on the object Not guaranteed No documented hang at Window system uses
(system) does the defect repeatable, but window OBP monitor or single mouse and full resolution
appear? system most often user of GX+ color
affected
5. When was the defect Call logged 1/18, problem Not right after delivery of Happening more often
first observed? has been ongoing for a systems using GX+ video during busy periods
while frame buffers on 12/12
6. When in the life cycle Five weeks after Not when system was New hardware; bedded in
was the defect noticed? installation brand new
9. How many objects One group of six Not all Sun workstations
(systems) are defective?
10. What is the trend? Worse, more frequent Not getting better or
stable
1. What object (system) is defective? The systems using the GX+ video frame buffer after 12/12
the last quarterly anti-static treatment was
completed.
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1 GX+ video frame buffer design or build fault Y N N ? Y N? Y - Y Y?
2 Environment (static) Y Y Y ? Y Y Y - Y Y
Application
Design
Final Repair
Environment static created the problem.
The instructor is the user, and you can ask the instructor questions
about the problem.
The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating. These were installed after midnight just prior to a
three-day holiday.
Host A
Host B
Host C
ping rlogin
The instructor is the user, and you can ask the instructor questions
about the problem.
The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating.
Host A
Host B
Host C
ping rlogin
The instructor is the user, and you can ask the instructor questions
about the problem.
The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating.
Host A
Host B
Host C
ping rlogin
Instructor Notes
This module is important for users new to the Sun environment, but
experienced personnel should review this module as well. Many
students need this systematic approach to help resolve problems.
The method used to fill in the fault analysis worksheet is the matrix
approach. It is not unique or even scientific, but the method works. For
the two classroom or lab exercises, use the following technique.
● Generate a list of likely causes for the problem, and try to fill in the
likely cause matrix. Using the data within the matrix, confirm or
eliminate supporting evidence within the matrix.
Instructor Notes
Complete the three fault exercises, Exercises 1-1, 1-2, and 1-3, in the
classroom. Before the students even begin to work on the equipment,
they should fill out the fault analysis worksheet and have at least two
or more action items completed, based on the class discussion.
Host A 192.200.30.10
Host B has the incorrect IP address for host A. This incorrect address is
important. You should replace the “1” with the lowercase “l”.
The last item asks why ping worked and rlogin did not?
Instructor Notes
Host A -
Host B
Host C - -
ping rlogin
Divide the class into half or thirds (if the class is large enough).
Group 1 will use the Exercise 1 worksheet, “Solving Host C.” Group 2
will use the Exercise 2 worksheet, “Solving Host B.”
You can also perform the required operations to replace the blanks in
the chart.
Have the students switch faults and use the abbreviated form to
perform the analysis.
Let them determine which form they will use for the remainder of the
course.
Objectives
Upon completion of this module, you will be able to:
References
The SPARC Architecture Manual - Version 8
2-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2
Introduction
The lab in this module will bring you back to an early level of
computer understanding and data manipulation – back to the 1’s and
0’s and register-bit mapping. The labs are architecture-dependent.
Error Types
● Bus Errors
● Interrupts
● Resets
● Types of errors
● Software errors
● Hardware-corrected errors
● Recoverable errors
● Fatal errors
● Critical errors
Bus Errors
Bus errors are issued to the processor when the processor references to
virtual or physical space that cannot be satisfied for hardware reasons.
Some typical bus errors occur:
● Error detected.
Resets
A reset attempts to bring the system to a well known (deterministic)
state. Types of resets include:
● System
● Power on
● Watchdog
● System software
Type of Errors
Software Errors
Errors that do not originate in the hardware are classified as software
errors. All such errors are detected by the processor and are reported.
Examples of software errors are programming errors or bugs in the
system code.
Hardware-Corrected Errors
For error-logging purposes, hardware-corrected errors are always
signaled by an interrupt. No recovery action is normally required. One
bit error from memory is corrected by the error checking and
correcting (ECC) logic. This is reported in the error log.
Recoverable Errors
Recoverable errors caused by hardware are usually signaled by a bus
error indication to the requesting device and a specified interrupt
(which could broadcast the error). Error recovery is normally handled
by the trap routines, while error logging is done by the interrupt
handler. A nonessential device losing power or becoming inaccessible
is an example of a recoverable error.
Fatal Errors
All fatal errors initiate a system-watchdog reset. Fatal errors
correspond to hardware errors in which proper system operation
cannot be guaranteed. Parity errors on backplanes are an example of a
fatal error.
Type of Errors
Critical Errors
Critical errors require immediate system shutdown and power-off.
They are notified through a high-level broadcast interrupt if at all
possible. Types of critical errors include:
● An AC/DC failure
● Temperature warning
● Fan failure
Primary Buses
Architecture
Sun Architecture
Architecture Model
Sun-4 4/330, 4/370, 4/390, 4/470, 4/490
Sun-4m SS5, SS10, SS20, 630, 670, 690, Classic, ClassicX, SSLX
Sun-4u
Architecture Ultra-4u
Address
UltraSparc Serial
Sysio sbus
Onboard
Bus UPA
Multiplexor
UPA
UPA connector
The MMU contains page table entries (PTEs) that are loaded by kernel
code during normal process execution.
● Page caching
A valid PTE indicates that the virtual address has been mapped to a
physical page in memory.
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
10 1010 a
11 1011 b
12 1100 c
13 1101 d
14 1110 e
15 1111 f
● Bit 31 (PTE valid bit) – When set to one (1), the PTE is valid.
● Bit 30 (Write access bit) – When set to one (1), page has write
access.
● Bit 29 (System access bit) – When set to one (1), system access is
enabled for that page.
● Bit 28 (Do not cache bit) – When set to one (1), caching is disabled.
● Bit 25 (Access bit) – When set to one (1), indicates page has been
accessed.
● Bit 24 (Modify bit) – When set to one (1), indicates page has been
modified.
31 30 29 28 27 26 25 24 23 19 18 00
● 0 0 – Main memory
● 0 1 – I/O space
31 30 29 28 27 26 25 24 23 19 18 00
● Bit 31 (PTE valid bit) – When set to one (1), means the PTE is valid.
● Bit 30 (Write access bit) – When set to one (1), page has write
access.
● Bit 29 (System access bit) – When set to one (1), system access is
enabled for that page.
● Bit 28 (Do not cache bit) – When set to one (1), caching is disabled.
● Bit 25 (Access bit) – When set to one (1), indicates page has been
accessed.
● Bit 24 (Modify bit) – When set to one (1), indicates page has been
modified.
31 30 29 28 27 26 25 24 23 19 18 00
● 0 1 – I/O physical
● 1 0 – I/O physical
● 1 1 – I/O physical
31 30 29 28 27 26 25 24 23 19 18 00
31 30 08 07 06 05 04 03 02 01 00
31 30 08 07 06 05 04 03 02 01 00
Access Code
1 0 0 - - x - - x
1 0 1 r w - r - -
1 1 0 r - x - - -
1 1 1 r w x - - -
31 30 08 07 06 05 04 03 02 01 00
31 30 08 07 06 05 04 03 02 01 00
Access Code
k0
k1
3. Use the p command to open a page map for virtual address 1000
and enable it to be modified if needed.
p 1000
d0000002 is selected.
^t<6> 1000
l 1000
00001000: 00000000? 12345678
00001004: 00000000?
>l 1000 This shows that you wrote to
virtual address 1000. No errors were detected.
00001000: 12345678?
p 1000
Page Map 00000000 [segment: 0000]: F0000000? a0000002
Page Map 00002000 [segment: 0000]: F0000001?
>^t 1000
l 1000
l 1000
00001000: 00000000? 1234
The next error forces an invalid PTE for virtual address 1000. As
you will see, not even a read can be performed. Once again, all
commands are highlighted including the error.
k0
k1
p 1000
Page Map 00000000 [segment: 0000]: D0000000? 20000002
Page Map 00002000 [segment: 0000]: D0000001?
^t 1000
Virtual Address 0x00001000 is mapped to Physical
Address 0x00005000.
Context=0x0, Segment Map=0x0, Page Map=0x20000002.
>l 1000
00001000:
Reset Procedure
To begin this workshop, you must obtain a Sun-4c workstation.
ok reset
(If you see a > monitor prompt, type n, then type reset.)
Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.
Refer to “Page Table Entry – Sun-4c Architecture” for the correct PTE
format for Sun-4c architecture. Console commands are in boldface
type. Use this information not to troubleshoot problems but to
understand the error detection mechanism used by the diagnostics
and operating system software.
Example 1
1. Type the following console command:
1000 map?
1000 map?
1000 20 ab fill
1000 20 dump
Example 1 (Continued)
The first error condition is a valid PTE that will be read only. You
will attempt to perform a write to the page, thus forcing the error
condition.
1000 map?
9. Type the following console command (to prove you can read):
1000 20 dump
1000 20 11 fill
serr@ .
Example 1 (Continued)
A hex value is returned indicating the type of error that was
detected. Refer to the table below for verification.
Bit Error
15 07 06 05 04 03 02 01 00
1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 0 1 0
Example 2
1. Perform the reset procedure on page 2-22. This resets the system
after the error.
1000 map?
1000 20 dump
serr@ .
Reset Procedure
To begin this workshop, you must obtain a Sun-4m workstation.
ok reset
(If you see a > monitor prompt, type n, then type reset.)
Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.
Example 1
1. Type the following console command:
1000 map?
1000 map?
1000 20 ab fill
1000 20 dump
Example 1 (Continued)
At this point, you have set up a known read/write condition and
ensured that it worked. Now, you will create an error condition.
The first error condition is a valid PTE that will be read-only. You
will attempt to perform a write to the page, thus forcing the error
condition.
1000 map?
10. Type the following console command to prove you can read:
1000 20 dump
1000 20 11 fill
Example 1 (Continued)
12. Type the following console command:
.sfsr
What is the value of the fault type field? Refer to the table below
for verification.
6 Internal error
5 Access bus or time-out
4 Translation error
3 Privilege violation
2 Protection error
1 Invalid address
0 No error
Example 2
1. Perform the reset procedure on page 2-27 to reset the system after
the error.
1000 map?
1000 20 dump
.sfsr
What is the value of the fault type field? Refer to the “sfsr Fault
Types” table for verification.
Reset Procedure
To begin this workshop, you must obtain a Sun-4d workstation. Do
one of the following, depending on the state of your system.
ok reset
Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.
Refer to “Page Table Entry – Sun-4d Architecture” for the correct PTE
format for Sun-4d architecture. Console commands are in boldface
type. Use this information not to troubleshoot problems but to
understand the error detection mechanism used by the diagnostics
and operating system software.
Example 1
1. Type the following console command:
1000 map?
1000 map?
1000 20 ab fill -
1000 20 dump
Example 1 (Continued)
At this point, you have set up a known condition (read/write) and
ensured that it worked. Now, you will create an error condition.
The first error condition is a valid PTE that will be read-only. You
will attempt to perform a write to the page, thus forcing the error
condition.
1000 map?
10. Type the following console command to prove you can read:
1000 20 dump
1000 20 11 fill
.sfsr
Example 1 (Continued)
What is the value of the fault type field? Refer to the table below
for verification.
4 Translation error
3 Privilege violation
2 Protection error
1 Invalid address
0 No error
Example 2
1. Perform the reset procedure on page 2-32 to reset the system after
the error.
1000 map?
1000 20 dump
.sfsr
What is the value of the fault type field. Refer to the “sfsr Fault
Types” table for verification.
Skills Checklist
No direct skills are associated with this module. This module and
associated workshops are used only to demonstrate the error-detection
mechanism. A field engineer would not be required to troubleshoot
the equipment with the skills used within the workshop.
Instructor Notes
Primary Buses
The diagram illustrates the various Sun architectures with their
primary buses. Primary buses are defined as the buses that enable data
transfer to take place inside the system. The 600MP is a transitional
machine. This enabled the customer to upgrade Sun-4 architecture to
Sun-4m architecture. The SBus enabled the customer to enter the
world of SBus options.
The Sun-4u (Ultra) has a Universal Port Architecture bus that (in
conjunction with the Bus Multiplexor) is an exponentially faster and
more versatile bus architecture for the following reasons:
The workshop will set up a condition that will enable a user to read
and write to a page in memory. Then another condition will be set up
to create an error. Using tables and charts, the student will interpret
the error.
Reference
Sun-4d Technical Reference Manual, Chapters 7, 8, and 9
Instructor Notes
Conversion Review
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
10 1010 a
11 1011 b
12 1100 c
13 1101 d
14 1110 e
15 1111 f
The following is a selected PTE (0000039a) from the Sun-4d lab. Each
value in the PTE represents one base(16) value. When converted to
Base(2), it looks like:
Instructor Notes
Instructor Notes
31 00
ebe l at ft fav ow
● <1> fav (Fault address valid bit) – If it is 1, then the content of the
fault address register is valid. The fault address register must be
valid for translation and data errors.
Instructor Notes
● 1 – Level 1 entry
● 2 – Level 2 entry
● 3 – Level 3 entry
Objectives
Upon completion of this module, you will be able to:
References
Field Engineer Handbook, Volume 1 and 2, Part Numbers 800-4006 and
800-4247
3-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3
Diagnostics Overview
Extended User
*POST POST diagnostics
Installed as package
Requires Solaris operating system
Diagnostics Overview
● Conduct all hardware bus probes, and save information for the
operating system’s automatic reconfiguration (ok boot -r) and
memory sizing
Note – A deliberate limitation of the boot PROM POST is that the I/O
devices themselves are not tested, only the devices and buses required
to access the boot device are tested.
● Installed as a package
Boot PROM
Machine
LEDs
POST diags instructions IU
Run at power-on CPU
or a system reset chip
Test numbers
(Some desktops
only use LEDs on
keyboard)
Boot PROM
LEDs
POST diags IU
Run at power-on CPU
or a system reset chip Test numbers
Serial port A 7 3 2
Transmit data Modem port
Transmit data 2
Receive data ASCII
Receive data 3
terminal
Signal ground Signal ground 7
% tip hardwire
connected
Serial port A
Serial port A or B
Broken machine in
diagnostic mode Good machine
Machine Information
The information below describes the machine used for this example.
● SPARCstation 5, no keyboard
# tip hardwire
$$$$$ WARNING: No Keyboard Detected! $$$$$
MMU Context Table Reg Test
MMU Context Register Test
MMU TLB Replace Ctrl Reg Tst
MMU Sync Fault Stat Reg Test
MMU Sync Fault Addr Reg Test
MMU TLB RAM NTA Pattern Test
MMU TLB CAM NTA Pattern Test
MMU TLB LCAM NTA Pattern Test
IOMMU SBUS Config Regs Test
IOMMU Control Reg Test
IOMMU Base Address Reg Test
IOMMU TLB Flush Entry Test
IOMMU TLB Flush All Test
SBus Read Timeout Test
EBus Read Timeout Test
D-Cache RAM NTA Test
D-Cache TAG NTA Test
I-Cache RAM NTA Test
I-Cache TAG NTA Test
Memory Address Pattern Test
FPU Register File Test
FPU Misaligned Reg Pair Test
The following example shows the code output when using the tip
command. The correct response is connected, and the POST is
displayed.
Diagnostics Output
The diagnostics run on all system boards, testing all CPU modules,
buses and memories.
# tip hardwire
connected
0B>
BIST Status = 00000001 Signature - CPU = 6ED695A2
0B>map16 test
0A>
BIST Status = 00000001 Signature - CPU = 6ED695A2
0B>
**** SPARCserver_1000 MP POST Rev 8 ****
The results of the POST normally pass quickly on the display. You can
view the results using the DEMON menu.
0A>total pmem 0x00008000 [pages] 0x008000000 [bytes] in 1 chunks
0A>DRAM chunk 0 base 0x00000000 size 0x00008000
0A> (0=failed,1=passed,blank=untested/unavailable)
(sbus 1=card present,0=card not present,x=failed)
0A>------+---------+------+-------+------+----+-----+----+--------+-------+------+-----+
0A> Slot | cpuA | bw0 | cpuB | bw0 | bb | ioc0| sbi| mqh0 | mem |sbus |xd0|
0A>------+---------+------+-------+------+----+-----+----+--------+-------+------+-----+
0A> 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 | 0011| 1 |
0A> 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 | 0011| 1 |
0A>------+--------+------+-------+------+----+-----+----+--------+--------+------+-----+
The next area displays the POST DEMON menu. It shows the steps
necessary to view system parameter information. The keys are
considered hot keys. You do not need to press Return after you press a
hot key.
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
Command ==> 0
System Parameters
0A>Select one of the following functions
0A> '0' Set POST Level
0A> '1' Dump Device Table
0A> '2' Display System
0A> '3' Dump Board Registers
0A> '4' Dump Component IDs
0A> '5' Clear Error Logs
0A> '6' Display Simms
0A> '7' Scrub Main Memory
0A> 'r' Return
Command ==> 2
0A> (0=failed,1=passed,blank=untested/unavailable)
(sbus 1=card present,0=card not present,x=failed)
0A>------+-------+-----+-------+------+---+------+----+--------+-------+------+-----+
0A> Slot | cpuA | bw0 | cpuB | bw0 | bb | ioc0| sbi | mqh0 | mem |sbus |xd0|
0A>-----+-------+------+-------+-----+----+------+----+--------+-------+------+-----+
0A> 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 |0011| 1 |
0A> 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 |0011| 1 |
0A>-----+-------+------+-------+-----+----+------+----+--------+-------+------+-----+
0A>Memory Group Status
(0=failed,1=passed,m=simm missing,c=simm
mismatch,blank=unpopulated/unused)
0A>+----+------+------+------+------+
0A> Slot| g0 | g1 | g2 | g3 |
0A>+---+-------+------+------+------+
0A> 0 | 1 | 1 | | |
0A> 1 | 1 | 1 | | |
0A>+---+-----+-------+-------+------+
0A>Hit any key to continue :
Command ==> r
0A>
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
0A>
Command ==> 5
0A>
-------------- Error Log Analysis for Board 0 --------------
0A>
-------------- Error Log Analysis for Board 1 --------------
0A>
-------------- System Memory Failure Analysis ----------------
0A> No Bad groups found
0A>Hit any key to continue :
0A>
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
0A>
Command ==>r
0A>
ttya initialized
Probing Memory Bank #0 128 Megabytes
SUNW,SPARCserver-1000
Cpu #0 cpu-unit TI,TMS390Z55
Cpu #1 cpu-unit TI,TMS390Z55
Cpu #2 cpu-unit TI,TMS390Z55
Cpu #3 cpu-unit TI,TMS390Z55
mem-unit mem-unit
bif bif
bootbus zs zs eeprom sram leds bootbus zs zs eeprom sram leds
io-unit sbi
Probing /io-unit@f,e0200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e0200000/sbi@0,0 at 1,0 cgsix
Probing /io-unit@f,e0200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e0200000/sbi@0,0 at 3,0 SUNW,soc SUNW,pln SUNW,ssd
SUNW,pln SUNW,ssd
io-unit sbi
Probing /io-unit@f,e1200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 1,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e1200000/sbi@0,0 at 3,0 Nothing there
Probing Memory Bank #0 128 Megabytes
SUNW,SPARCserver-1000
Cpu #0 cpu-unit TI,TMS390Z55
Cpu #1 cpu-unit TI,TMS390Z55
Cpu #2 cpu-unit TI,TMS390Z55
Cpu #3 cpu-unit TI,TMS390Z55
mem-unit mem-unit
bif bif
bootbus zs zs eeprom sram leds bootbus zs zs eeprom sram leds
io-unit sbi
Probing /io-unit@f,e0200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e0200000/sbi@0,0 at 1,0 cgsix
Probing /io-unit@f,e0200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e0200000/sbi@0,0 at 3,0 SUNW,soc SUNW,pln SUNW,ssd
SUNW,pln SUNW,ssd
io-unit sbi
Probing /io-unit@f,e1200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 1,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e1200000/sbi@0,0 at 3,0 Nothing there
Healthy system
Null modem
cable
or
Modem
Faulty system
Note – Before you begin, make sure that the healthy system has the
Solaris operating environment booted to multiuser mode and has a
window system running or available.
5. Turn off the faulty system to prevent blowing the keyboard fuse.
# /usr/openwin/bin/openwin
Note – The hardwire argument says that the tip command expects
9600 baud, 8 data bits, and 1 stop bit at port B on the CPU board, not
an ALM or SPC. It is not a coincidence that these are the parameters
set for Port A when a machine powers up without a keyboard.
8. If port A is the only available port, edit the /etc/remote file for
port A on “good” system
● Before edit:
:dv=/dev/term/b:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D
● After edit:
:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D
# tip hardwire
12. Why are you getting an error that looks like a “Net” error?
Notes
13. Press ~Control-d or ~ . to end the tip session. (See “POST tip
Commands.”)
You can also display POST tests on nearly any ASCII terminal or
laptop.
~#
~.
Or
~ ^d (tilde Control-d)
~?
For more information on the tip command, refer to the on-line man
pages.
Instructor Notes
POST will test just enough of the electronic circuitry to ensure that the
boot command and instruction execution can be performed.
The ST-350 course is not to teach the content of POST but to enable the
students to observe the execution of POST. If POST fails, they will take
the appropriate action as dictated by their company policy.
The POST examples in the student guide provide visual examples for
students who have never seen POST run.
POSTs either pass or fail. Learning or teaching what each tests does is
counterproductive at this time.
Objectives
Upon completion of this module, you will be able to:
● Test devices using the device path, node name, and device
alias.
● Alter any NVRAM setting, display the settings, and reset to the
defaults.
4-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4
References
Field Engineer Handbook, Volume 1 and 2, Part Numbers 800-4006 and
800-4247
Features
● Ability to read plug-in device drivers and diagnostics from probed
devices. (Early Sun machines required all boot drivers and
diagnostics to be completely written in the boot PROM.)
● User-callable diagnostics
OpenBoot PROM
OBP
NVRAM
> Limited commands
setenv Variable
OK full FORTH commands printenv system
parameters
FORTH code
POST Battery
Extended POST
User diagnostics
NVRAM defaults
Host ID contains:
● CPU-type code
SPARCstation 20 Workstation
<#2> ok printenv
Parameter Name Value Default Value
<#2> ok
Diagnostic Overview
Extended User
*POST POST diagnostics
Installed as package
Requires Solaris operating system
Error
Init system indication Init system
Pass Fail
Error Fail Pass
indication
Boot-device
sunmon-compat? boot-file Error
security-mode? Auto-Boot? indication
Start boot sequence
False True False True
OK >
sunmon-compat? diag-device
security-mode? diag-file
False True Start boot sequence
OK >
ok boot
Execute primary
boot—OBP
Kernel reads
/etc/system
Kernel
initialized
Execute rc scripts
<#0> ok cd /
<#0> ok ls
ffda476c io-unit@f,e1200000
ffd91c10 io-unit@f,e0200000
ffd8d2f4 mem-unit@f,e1100000
ffd8d210 mem-unit@f,e0100000
ffd8cebc cpu-unit@f,e1800000
ffd8cb68 cpu-unit@f,e1000000
ffd8c814 cpu-unit@f,e0800000
ffd8c4c0 cpu-unit@f,e0000000
ffd839a8 boards
ffd712fc openprom
ffd702bc virtual-memory@0,0
ffd7016c memory@0,0
ffd625cc aliases
ffd6257c options
ffd6252c packages
<#0> ok cd io-unit@f,e1200000
<#0> ok ls
ffda4d20 sbi@0,0
<#0> ok cd sbi
<#0> ok ls
ffdb0ffc lebuffer@1,40000
ffdac1f4 dma@1,81000
ffda9ff4 lebuffer@0,40000
ffda51ec dma@0,81000
<#0> ok cd dma@1,81000
<#0> ok ls
ffdac878 esp@1,80000
<#0> ok cd esp@1,80000
<#0> ok ls
ffdb05b4 st
ffafef4 sd
Target 0
Unit 0 Disk CONNER CP30548 SUN0535AEBX93081BWC
Target 1
Unit 0 Disk CONNER CP30548 SUN0535AEBX93082TZA
Target 2
Unit 0 Disk CONNER CP30548 SUN0535AEBX93082MD4
Target 3
Unit 0 Disk CONNER CP30548 SUN0535AEBX93081BRX
/io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000
Target 0
Unit 0 Disk CONNER CP30548 SUN0535AEB793081TGX
Target 1
Unit 0 Disk CONNER CP30548 SUN0535AEB793081WNL
Target 2
Unit 0 Disk CONNER CP30548 SUN0535AEB793081Q8Z
Target 3
Unit 0 Disk CONNER CP30548 SUN0535AEB7930810A0
/io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000
Target 0
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 1
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 2
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 3
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 4
Unit 0 Removable Tape ARCHIVE Python 28454-XXX4.28
Target 6
Unit 0 Removable Read Only device SONY CD-ROM CDU-8012 3.1e
<#0> ok
<#0> ok show-sbus
Board# 0 SBus slot 0 lebuffer le dma esp
Board# 0 SBus slot 1 cgsix
<#0> ok module-info
CPU# 0 : 50.0 MHz SuperSPARC / SuperCache
CPU# 1 : 50.0 MHz SuperSPARC / SuperCache
CPU# 2 : 50.0 MHz SuperSPARC / SuperCache
CPU# 3 : 50.0 MHz SuperSPARC / SuperCache
<#0> ok print-nvram-stat
Board#0 -- nvram master, Prom Version 2.13
Board#1 -- nvram slave, Prom Version 2.13+0.08
Board#2 -- no board or no Viking module
Board#3 -- no board or no Viking module
<#0> ok show-sbus
SBus slot f SUNW,bpp ledma le espdma esp
SBus slot e SUNW,DBRIe
SBus slot 0
SBus slot 1
SBus slot 2 cgsix
SBus slot 3
<#0> ok probe-scsi
Target 1
Unit 0 Disk QUANTUM P105SS 910-10-94A.1 08/31/89009030144
GENERIC
Target 3
Unit 0 Disk SEAGATE ST31200W SUN1.05872400795741
Copyright (c) 1994 Seagate
All rights reserved 0000
Target 4
Unit 0 Removable Tape ARCHIVE VIPER 150 21531-003 SUN-03.00.00
Target 6
Unit 0 Removable Read Only device TOSHIBA XM-
4101TASUNSLCD108404/18/94
<#0> ok module-info
MBus : 50.00 MHz
SBus : 25.00 MHz
CPU#0 : 50.00 MHz SuperSPARC
CPU#2 : 50.00 MHz SuperSPARC
<#0> ok 2 switch-cpu
<#2> ok 0 switch-cpu
<#0> ok 2 switch-cpu
IMPL:0
<#2> ok 1 switch-cpu
Processor #1 is not present!
Lab 1
In this lab you will test devices using the device path, node name, and
device alias.
Note – Due to different PROM levels and architectures the syntax for
these labs can vary slightly. Refer back to the OBP reference card if
necessary.
2. Use help to list some PROM level diagnostics, and run them al.l
ok help diag
Category: Diag (diagnostic routines)
test device-specifier ( -- ) run selftest method for specified device
Examples:
test /memory - test memory
test /iommu/sbus/ledma@5,8400010/le - test net
te................
...................
ok setenv selftest-#megs 99 (setting up to test 99 megs of memory)
ok test-memory
Testing memory \/
ok test net
Notes
Note – If the ok prompt returns with no message, this means the self
test found no errors.)
Notes
Lab 2
● Alter any NVRAM setting, display the settings, and reset to the
defaults
You are directed to use selected console commands and observe the
output. You can determine if you find the results useful.
help diag
help watch-tpe
show boot-device
show-hier
show-ttys
show-tapes
show-nets
show-disks
module-info
devalias
show-attrs
show-devs
printenv
printenv diag-switch?
show diag-switch?
set-default diag-switch?
show diag-switch?
set-defaults
Or do the following:
During power on or after the ok reset, hold down the Stop (L1) and n
keys simultaneously on the Sun keyboard. (There is no corresponding
simple key hold down to reset NVRAM to defaults from a port
connection.)
Optional
The NVRAM settings can also be changed by root from the operating
system:
# /usr/sbin/eeprom
# /usr/sbin/eeprom boot-device=disk1
Notes
Lab 3
In this lab you will display and capture the names of the devices in the
system device tree and display their attributes. This is useful in
isolating failures of Sun or third-party devices between hardware or
software problems.
Note – The lab will take you to one device; if you have time, go out
and display some others.
ok cd /
ok ls
ffd3c184 FMI,MB86904
.........
ok cd iommu@0,10000000
ok ls
ffd2c2c8 sbus@0,10001000
ok cd sbus@0,10001000
ok ls
ffd42504 cgsix@3,0
f.......
ok cd cgsix@3,0
ok ls
ok .attributes
character-set ISO8859-1
intr 00000039 00000000
reg 00000003 00000000 01000000
dblbuf 00000000
v0,64125000,108000000,94500000
chiprev 0000000b
device_type display
model SUNW,501-2325 (look at this, the Sun part #!)
name cgsix
Lab 4
In this lab, you will generate and test a PROM device alias.
With the increased use of storage arrays and other variously addressed
devices, it is important to be able to set a simple name for the device
that the customer can boot from or otherwise use.
Note – If you recreate the tip hardwire session, you can cut and paste,
instead of typing a lot of the entries in the lab.
2. ok show-disks
a) /obio/SUNW,fdtwo@0,400000
b) /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
q) NO SELECTION
Enter Selection, q to quit: b
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd has
been selected.
Type ^Y ( Control-Y ) to insert it in the command line.
e.g. ok nvalias mydev ^Y
for creating devalias mydev for
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
3. ok nvalias newdisk^Y
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0
4. ok devalias
newdisk
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0
screen /iommu@0,10000000/sbus@0,10001000/cgsix@3,0
ttyb /obio/zs@0,100000:b
5. ok boot newdisk
Note – Of course the boot will probably fail here unless, somehow, a
bootblock was placed on it. You will be setting up for alternate boots
in a later module.
Hand edit the nvramrc file using information from the device tree;
then enable the use of it. (This is required currently for making aliases
for storage array devices or with older PROMs that do not support the
nvalias command.)
ok devalias
cd /
ok ls (just to find our way!)
ffd3c184 FMI,MB86904
ffd2d1e0 virtual-memory@0,0
ffd2d124 memory@0,0
ffd2c458 obio
ffd2c184 iommu@0,10000000
ok cd iommu@0,10000000
ok ls
ffd2c2c8 sbus@0,10001000
ok cd sbus@0,10001000
ok ls
ffd4242c cgsix@3,0
ffd423cc power-management@4,a000000
ffd41c80 SUNW,CS4231@4,c000000
ffd40024 ledma@5,8400010
ffd3ff98 SUNW,bpp@5,c800000
ffd3cea4 espdma@5,8400000
ok cd espdma@5,8400000
ok ls
ffd3d280 esp@5,8800000
ok cd esp@5,8800000
ok ls
ffd3f854 st
ffd3f13c sd
ok cd sd
ok pwd
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
ok nvedit
Lab 5- Optional
In this lab, you will construct, download, and run FORTH macros.
1. 1. Set up the tip command like you did in POST lab. That is, one
machine at the ok prompt displayed in another machine’s “tip
hardwire.”
4. Due to the fact that the macros you construct do not survive a
power on reset, construct a macro in a file that you can download
any time you want.
You are going to create the file in the machine that is up running
the operating system now; then download it to the machine that is
at the ok prompt.
proto2# vi /opt/mapping
: mapping
38e 1000 pgmap!
1000 map?
1000 100 ab fill
1000 100 dump
Notes
0 : multilpy3 <cr>
Notes
Instructor Notes
OBP
This is a simple visual representation of the two primary hardware
components that impact boot. The OBP contains the primary bootstrap
program and NVRAM contains essential system parameters
concerning the boot sequence.
● le – Ethernet port
OBP Diagnostics
Objectives
Upon completion of this module, you will be able to:
References
Solaris User and System Administration Answerbooks
5-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
5
✓ In trial teaches, presenting these tools in seminar fashion has proved very popular and is
perceived by most students as a good use of time consistent with the “Fault Analysis
Workshop” philosophy.
✓ In true guided-seminar fashion, this list will tend to grow as students and other
instructors participate; see the “Open Discussion” section at the end.
Open Discussion
1.
2.
3.
4.
5.
6.
7.
8.
9.
Objectives
Upon completion of this module and lab, you will be able to:
6-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6
Introduction
The SunVTS tests can be used to stress certain areas of the system as
needed for diagnostic and troubleshooting purposes.
SunVTS application
programming interface
Logs messages
Test interface
SunVTS User-created
hardware tests custom tests
User Interfaces
Kernel
The kernel runs as a background process, a daemon. Upon startup of
the SunVTS software, the SunVTS kernel probes the system kernel for
installed hardware devices. Those devices are displayed on the
SunVTS user interface.
Both the SunVTS kernel and the user interface must be started before
testing can begin.
Hardware Tests
For each supported hardware device, a corresponding hardware test
can validate its operation. Each test is a separate process from the
SunVTS kernel process.
Additional References
For more extensive information and usage of the SunVTS diagnostic
software, see the following publications:
The pkgadd command is used to install SunVTS software from the CD-
ROM Updates for Solaris Operating Environment 2.5 (Part Number 704-
5104-10).
Insert the CD-ROM into the CD-ROM drive, and type the pkgadd
command as root:
# pkgadd -d /cdrom/upd_sol_2_5_smcc/SMCC
View the screen output from the pkgadd application to ensure that the
install completed successfully.
# /opt/SUNWvts/bin/sunvts
System Status panel Performance meter Control panel Tests Selection panel
● The Control panel – A panel that contains the buttons that you
use to control the SunVTS user interface.
● The Test Option panel – A panel where you select the tests and
test groups to run; you can also change the options for each test
and test group.
● The Tests Selection panel – A panel where you choose the global
options for all SunVTS tests.
● The System Status panel – A panel that shows the general testing
status.
● The Test Status panel – A panel that displays pass and error
counts for each test and test group.
The following are the buttons on the control panel and their functions:
Stop Click on the Stop button to halt all active tests. The
test results remain on the Test Status panel after
testing is completed. Click on the Stop button only
once. Some tests do not stop immediately, so the
System Status may slowly change from Stop to Idle.
Quit Using the Quit button, you can terminate the user
interface, the SunVTS kernel, or both.
Sys Config Click on the Sys Config button to display the Sys
Config menu. Menu choices are display or print test
system configuration information, or reprobe the test
system.
Log Files SunVTS saves the status of its progress in three log
files. Use the Log Files button to look at the error
messages, information, or UNIX® messages log files.
From the Test Selection panel, you can select the tests you want to run,
and specify the testing options.
Options can be set globally for all of the SunVTS tests you select. Click
on the Set Options button for the SunVTS Testing Options menu.
Options can also be set for each test group. Press the button of a test
group or test name for the option menu.
The following options can be set to apply to all selected SunVTS tests
or, if applicable, to individual test groups or tests.
group_override
Supersedes the specific test options in favor of the
group options in this window.
group_concurrency
Sets the number of tests you want to run at the same
time in the same group.
num_instances
Specifies the number of tests to run for all tests that
are scalable.
Tests Switch
Three settings are available:
● Default enables the default group of tests. This includes all tests
that do not require intervention.
Option Files
You can save your SunVTS testing selections to a file. This prevents
you from having to reset these same options again in the future. Test
settings are saved in the /var/adm/sunvtslog/options directory.
To save an option file, type a name for the option file, and click on the
Store button.
Intervention
Certain tests require that you intervene before you can run the test
successfully. These include tests that require media or loopback
connectors.
You cannot select these tests until you enable the intervention mode.
This setting does not change the test function; it just serves as a
reminder that you must intervene before the test can be successfully
completed.
The icons at the top of the Test Status panel enable you to navigate
through the list of tests in case there are more tests running than can
be displayed on the panel.
Errors are also recorded in a log file that you can view by clicking on
the Log File button on the Control panel.
Log Files
You can use the Log Files menu to view error, information, and UNIX
message log files that are managed by the system.
4. Display the Information and UNIX Msgs files, but do not remove
any files.
# /opt/SUNWvts/bin/vtsk
2. Start the SunVTS TTY User Interface with the vtstty command:
#/opt/SUNWvts/bin/vtstty
Only one panel has focus (selected for keyboard input) at a time. Focus
can be shifted between the three panels by pressing the tab key. The
panel with focus is bordered by asterisks (*).
Selected panel
Control panel
Tests panel
Status panel
Console
Kernel Interface
To test a remote system, it must have the kernel process
/opt/SUNWvts/bin/vtsk running.
User Interface
To test local system, the user interface can be either TTY (teletype) or
graphical.
User Interface
The graphical user interface (GUI) component must have the interface
/opt/SUNWvts/bin/sunvts running as an active process.
User Interface
You can also connect directly to the remote computer running the
SunVTS kernel when starting the graphical user interface.
/opt/SUNWvts/bin/sunvts -h remote_hostname
TTY interface
Lab Overview
✓ You will need a copy of the SUNWvts package to put on your classroom’s server so your
students can perform a pkgadd. The package is available on the CD-ROM, SMCC Updates
for Solaris 2.5 (Part Number 704-5104-10).
Lab Objectives
● Install the SunVTS package on a system.
Equipment
To complete this lab, you will need:
Lab Tasks
In this lab, you are going to verify that all hardware on your lab
system is functional. You will need the SunVTS software present on
your system.
# pkgrm SUNWvts
Lab Tasks
Now that you have a general idea of how the diagnostics work, here
are some steps to try to get more familiar with the features.
6. Kill the SUNWvts kernel process and try the previous two steps
again.
8. Run the audio test. Observe the different selections that are played
depending on the machine you are testing.
a. Auto-start.
c. Kill SunVTS.
11. Find the maximum number of passes allowed for the fputest?.
Lab Tasks
12. Attempt to force an error.
Objectives
Upon completion of this module, you will be able to:
7-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
References
● SunSolve Online User’s Guide
Overview
SunSolve 7-3
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Distribution
Updated CD-ROMs are sent out about ten times a year and have
information regarding all supported software, operating system levels,
and hardware.
● http://sunsolve.Sun.COM
● http://SunSolve1.Sun.COM
● http://www.Sun.Com
2. Click on the Create new account button and answer the questions.
(You must have a SunService Spectrum Account number to
register for a SunSolve Online account.) There is little or no wait in
receiving an account once you submit the form.
SunSolve 7-5
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Installing SunSolve
Install the SunSolve software and patches on a server and share them
correctly to the network.
✓ Set the lab up to allow multiple installations from one CD-ROM.
a. If this is the first time you have run the share command on
this machine, edit the /etc/dfs/dfstab file and add the
following line:
# vi /etc/dfs/dfstab
share -o ro /cdrom/sunsolve_2_8
# /etc/init.d/nfs.server start
# showmount -e
Or
# dfshares -F nfs servername
Installing SunSolve
SunSolve 7-7
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Installing SunSolve
Installing SunSolve
2. Click on Install.
SunSolve 7-9
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Installing SunSolve
Sharing SunSolve
To set the SunSolve server as shared, at /opt/SUNWss, perform the
following steps.
Or
# /etc/init.d/nfs.server start
Starting Sunsolve
SunSolve 7-11
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Starting Sunsolve
Note – If you are asked if you want to run in a Shell Tool, answer yes.
Search Tool
SunSolve 7-13
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Search Tool
Configuring SunSolve
To configure the SunSolve software, click on the Properties button in
the SearchTool window. The SearchTool properties window is
displayed.
Search Tool
SearchTool Properties
The SearchTool properties window contains a Category menu button
with the following property types:
Notice here that the maximum documents set to retrieve is 100, the
search timeout is set to 60 seconds (make the timeout longer if
searching across a network), and Fuzzy Boolean searching is on
(this helps to find related keywords in searches).
● Viewer – You can specify the text viewer, the PostScript viewer, or
the picture (GIF) viewer.
SunSolve 7-15
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
You are having printer problems under the Solaris 2.5 operating
environment; you can use the SearchTool window to search for
probable symptoms.
● Patch Descriptions
● printer
● 2.5
● AND – The logical AND means the collections searched must contain
all keywords joined by AND.
SunSolve 7-17
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
● Early Notifier
● Bug Reports
● Patch Descriptions
● Solaris Q & A
● Info Docs
SunSolve 7-19
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
1. From the SearchTool window, select only the Info Docs collection
to search.
Patches
5. From the Display menu, choose In new viewer. The 2.5 patch
report is displayed.
SunSolve 7-21
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
For lab setup, insert or mount the patches CD-ROM. The File Manager
window displays the following:
Patches
1. From the File Manager window shown on the previous page, click
on the patchinstall icon.
Note – If you are not running File Manager or OpenWindows, you can
start the patch install script by changing to the directory where the
patch CD-ROM is mounted and typing ./patchinstall as superuser.
SunSolve 7-23
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
You will see each installpatch script run. You might also see
messages such as Patch already installed: continue? |Y|.
After installing these (or any other patches), reboot the system unless
specifically given other instructions from the install script.
Patches
To install the above patch, type the patch ID number (102979 here),
instead of typing suggested when prompted for the patch to install.
Patch to install (patchid, suggested, ?): 102979
SunSolve 7-25
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
2. Find the installed location of the patch (they are usually installed
in the /var/sadm/patch directory).
# find / -name 102044-01 -print
(output omitted)
SunSolve Labs
Note – The question text below matches page headers in this module.
SunSolve 7-27
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
SunSolve Labs
11. Display the current patch report for a given operating system.
This section illustrates the method for conducting some basic searches
of the SunSolve information. It shows how to construct and refine a
search, and displays the results of a sample search.
SunSolve 7-29
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Choose the
collections you
want to search.
Select the area of
the document(s)
Enter keywords you want to
(and optional search.
operators) that
describe the
Click on the
Search button
to start the
What you enter here is the keyword that SearchTool will look for
in the collections. You can also use the optional operators to
further define your search.
The most commonly used area is entire doc, which looks in all
parts of all of the documents of the collections you have selected.
Each collection allows you to define your search by the areas
available in that collection. In some cases, you may know the
document ID number, and might want to search in the document
ID area of All Collections.
SunSolve 7-31
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
SunSolve 7-33
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Using MultiView
Once you have used SearchTool to locate the documents you want,
you can use MultiView to display or print the document or save the
document to a file. MultiView is the display tool for SearchTool. It is
capable of displaying the full range of document formats available in
the SunSolve collections.
SunSolve 7-35
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Document Formats
SunSolve 7-37
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Note – You should set the viewer types to Default unless you are
familiar with other tools that you would like to specify as Custom.
Default viewers have been selected to work with the document
collections.
Text Viewer
Default displays ASCII text files in the system text viewer. The Custom
selection displays ASCII files in a TextEdit window, or another text
window specified.
PostScript Viewer
If you are running on an Xterminal, you should set this to Custom and
the to name of the PostScript viewer. For example, to use ghostview,
replace the default pageview with ghostview.
Picture Viewer
The scroll list at the bottom of the SearchTool window lists documents
that match your search. You can use MultiView to display, print, email,
or save these documents to a file.
1. Click on the title of the document in the scroll list at the bottom of
the SearchTool window.
SunSolve 7-39
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
MultiView Features
Print Option
The Print option sends the document you are viewing to a printer. You
can specify which pages to print: All (the entire document), This page
(only the page you are presently viewing), or a Range of pages,
delimited by the From and To fields in the window. You can also
specify the name of the printer in the printer field. Click on the Print
button when all your choices are completed.
SunSolve 7-41
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
MultiView Features
Save Option
The Save option enables you to save the current document in a file.
Specify the location of the file in the window, and type the name of the
file in the Name field. Click on the Save button when all your choices
are completed.
Email Option
MultiView Features
Properties Option
SunSolve 7-43
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Instructor Notes
You can focus on using the SunSolve system on the CD-ROM within
the classroom. At least one system should always have SunSolve
shared and available.
Objectives
Upon completion of this module, you will be able to:
● Use the adb and crash commands to manipulate core files and to
locate a failing process or file.
● Use the adb and crash commands to isolate the failing processor,
instruction, thread, process, and file on three core dumps and on
one system hang.
References
The SPARC Architecture Manual, SPARC International
8-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8
Introduction
The UNIX operating system uses assertion checks throughout the kernel
code. Assertion checks are placed at critical points within the software.
When a call is made to the ASSERT() routine, a check is made. If the
condition is not true and the kernel module is compiled with the
DEBUG flag, the system panics. Also, within the code are data
integrity checks. If a data check fails, it calls upon the cmn_err()
routine.
● 17,000 assertions
When the system reboots, this core dump must be saved into files that
can then be passed to adb for analysis. savecore(1M) is used to
perform this function. Normally, the system does not examine the
swap area for core dumps when it boots. savecore() must be enabled
in /etc/init.d/sysetup.
Header Files
● /usr/include/sys/proc.h
● /usr/include/sys/thread.h
● /usr/include/sys/klwp.h
● /usr/include/sys/user.h
● /usr/include/sys/cred.h
● /usr/include/vm/as.h
● /usr/include/vm/seg.h
Debuggers
adb
adb is an interactive, general-purpose debugger. It can be used to
examine files, and it provides a controlled environment for the
execution of programs. adb reads commands from the standard input
and displays responses on the standard output. It does not supply a
prompt.
crash
The crash command is used to examine the system memory image of
a running or a crashed system by formatting and printing control
structures, tables, and other information. Command-line arguments to
crash are dump file, name list, and output file.
kadb
kadb is an interactive debugger with a user interface similar to that of
adb(1), the assembly language debugger. kadb must be loaded prior
to the standalone program it is to debug. It runs in the same address
space as the standalone program, thus sharing many resources with
that program. The debugger is cognizant of and able to control
multiple processors if they are present in a system.
Unlike adb, kadb runs in the same supervisor virtual address space as
the program being debugged although it maintains a separate context.
The debugger runs as a coprocess that cannot be killed (`:k').
SAVECORE Setup
##
## Default is to not do a savecore
##
#if [ ! -d /var/crash/`uname -n` ]
#then mkdir -p /var/crash/`uname -n`
#fi
# echo ‘checking for crash dump...\c ‘
#savecore /var/crash/`uname -n`
# echo ‘’
To:
##
## Default is to not do a savecore
##
if [ ! -d /var/crash/`uname -n` ]
then mkdir -p /var/crash/`uname -n`
fi
echo ‘checking for crash dump...\c ‘
savecore /var/crash/`uname -n‘
echo ‘’
Invoking adb/kadb/crash
adb
# cd crash_directory
# adb -k unix.n vmcore.n
crash
# crash vmcore.n unix.n
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
>
kadb
ok boot disk kadb
adb Commands
If address is omitted, the current location is used. (The dot [.] also
stands for the current location.) The address can be a kernel symbol. If
the count is omitted, it defaults to 1.
x or X Displays in hex.
Examples
v+0
v: 100 examine a symbolic location
v+0/D examine a symbolic location - display content decimal
v:
v: 100
v+0/X examine a symbolic location - display content hex
v:
v: 64 e
v+0=X Determine VA of symbolic location v
f017255c
f017255c/X examine content of a VA
64
fc63ecbc/i examine a VA for an instruction(disassemble)
backseat_write:sethi%hi(0xfffffc00), %g1
$q - Quit.
adb Macros
$M Displays built-in macros (kadb).
During the development of the RAM disk driver, the system crashes
with a data fault when running newfs. The savecore command has
been enabled in the sysetup shell script. This enables copies of the
current kernel and core file to be saved when the system reboots.
There are times when the msgbuf variable used by the msgbuf macro
may not be loaded in the dynamic kernel symbol table, in which case
you would use the strings command on the vmcore.n file.
# strings vmcore.0
...
ASC = 0x4 (LUN not ready), ASCQ = 0x2, FRU = 0x0
BAD TRAP: cpu_id=2 type=9 <Data fault> addr=30 rw=1 rp=e0922ac4
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load>
level=3
MMU sfsr=0x326<FAV>
BAD TRAP occurred in module "ramd" due to an illegal access to a user
address.
mkfs: Data fault
kernel read fault at addr=0x30, pte=0x0
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load>
level=3
MMU sfsr=0x326<FAV>
...
Notice that you would get a lot of information which also includes the
panic message as returned by the $<msgbuf command. The rest of the
panic message is shown on the next page.
The message buffer has been edited, but in the workshop, you will see
the full message buffer. The bold area contains the important
information about the crash. The information located in the BAD TRAP
(1) message informs you of the type of fault detected (<Data Fault>),
which CPU detected the fault (id=2), register pointer (rp), and the
fault type (ft). The fault type indicates an <Invalid Address
Error>. Included within the panic message is the CPU ID and thread
(sequence of instructions) executing at the time of the crash.
You have almost all the information located in the message buffer to
determine most of the information about the system crash.
The rest of the crash dump analysis uses adb macros and commands to
navigate through a crash dump to get data that may not be available
through the message buffer (or if the message buffer is not available,
for whatever reasons).
BAD TRAP: cpu_id=2 type=9 <Data fault> addr=30 rw=1 rp=e0922ac4
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> l
evel=3
MMU sfsr=0x326<FAV>
BAD TRAP occurred in module "ramd" due to an illegal access to a
user address.
mkfs: Data fault
kernel read fault at addr=0x30, pte=0x0
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> l
evel=3
MMU sfsr=0x326<FAV>
ram_write+0x2c, pid=363, pc=0xf06ad304, sp=0xe0922b10, psr=0x400
000c4, context=39
g1-g7: ffffff98, 0, e00afac4, 40, f0bb0bd8, 1, f06c16c0
Begin traceback... sp = e0922b10
write+0x190 @ 0xe00afc54, fp=0xe0922b78
args=d80000 e0922bd8 f03b1c18 d8 f0287d48 f06ad2d8
The $c macro (1) displays the stack. Also note, the cmn_err() routine
is called. This fault was determined to be a nonrecoverable error
ending up in a panic. In Solaris 2.5, notice that the stacktrace is very
indicative of the reason for the fault through the presence of the
ram_write() driver routine that caused the system to go down.
$c
complete_panic(0xe024c800,0x1,0xe0241800,0xf05b2ab8,0x5,0xe024c800)
+ d0
do_panic(?) + 20
vcmn_err(0xe02496b0,0xe092297c,0xe092297c,0x18,0x18,0x3)
cmn_err(0x3,0xe02496b0,0xe0251fa0,0x0,0x12778,0xdffffad0) + 1c
die(0x9,0xe0922ac4,0x30,0x326,0x1,0xe02496b0) + 120
trap(0x0,0xe0922ac4,0x30,0x326,0x1,0x0) + 498
fault(?) + 7c
Syssize(via
getminor)(0x0,0x3ffff,0x20,0x7fffffff,0xf0a5829c,0x315c1813)
ram_write(0xd80000,0xe0922bd8,0xf03b1c18,0xd8,0xf0287d48,0xf06ad2d8
) + 1c
write(0x5) + 190
Using the value in the pc field, you can determine the instruction that
was executing at the time of the panic with the adb i command (1).
The results of this command indicate that a load instruction was
executing at an address given by the symbol ram_write+0x2c. With
the /i command, you have determined the assembly instruction that
caused the system to go down.
You can use the cpu macro (1) to navigate the CPU data structure to
locate the thread (which you already know from the message buffer).
The macro will open the CPU data structure for the first CPU (id=0).
Since you know this was not the CPU (message buffer), you will use
the content of next field. This points to the address of the next CPU
data structure. Note also, thread and idle thread (idle_t) are equal.
This indicates this CPU was idle.
cpu0$<cpu
cpu0:
cpu0: id seqid flags
0 0 1d
cpu0+0xc: thread idle_t pause
e06c1ec0 e06c1ec0 e08a0ec0
cpu0+0x18: lwp callo fpowner
0 0 f06a30c0
cpu0+0x24: next prev next on prev on
f05852d0 f05b2ab8 f05852d0 f05b2d58
cpu0+0x34: lock npri queue limit actmap
0 110 f036e568 f036ea90 f028dbc0
cpu0+0x44: maxrunpri max unb pri nrunnable
-1 -1 0
cpu0+0x50: runrun kprnrn dispthread thread lock
0 0 e06c1ec0 0
cpu0+0x5c: intr_stack on_intr intr_thread intr_actv
e06dffa0 1 e06dcec0 0
cpu0+0x6c: base_spl
0
Follow the boldface type to locate the CPU at the time of the fault.
Note thread and idle thread.
f05852d0$<cpu
0xf05852d0: id seqid flags
1 1 1d
0xf05852dc: thread idle_t pause
e06feec0 e06feec0 e0721ec0
0xf05852e8: lwp callo fpowner
0 0 f06a30c0
0xf05852f4: next prev next on prev on
f0585030 e0251120 f0585030 e0251120
0xf0585304: lock npri queue limit actmap
0 110 f057b580 f057baa8 f028d660
0xf0585314: maxrunpri max unb pri nrunnable
-1 -1 0
0xf0585320: runrun kprnrn dispthread thread lock
0 0 e06feec0 0
0xf058532c: intr_stack on_intr intr_thread intr_actv
e071ffa0 1 e071cec0 0
0xf058533c: base_spl
0
Finally, you have arrived at the correct CPU data structure. Another
key point has been reached. You can display the content of the thread
data structure. You know the thread from the message buffer. Note the
threads.
f561cc00$<cpu
0xf0585030: id seqid flags
2 2 1d
0xf058503c: thread idle_t pause
f06c16c0 e0723ec0 e0746ec0
0xf0585048: lwp callo fpowner
0 0 f0a8e810
0xf0585054: next prev next on prev on
f05b2d58 f05852d0 f05b2ab8 f05852d0
0xf0585064: lock npri queue limit actmap
0 110 f057b040 f057b568 f028db10
0xf0585074: maxrunpri max unb pri nrunnable
-1 -1 0
0xf0585080: runrun kprnrn dispthread thread lock
0 0 e0723ec0 0
0xf058508c: intr_stack on_intr intr_thread intr_actv
e0744fa0 1 e0741ec0 0
0xf058509c: base_spl
0
The thread that caused the panic can also be obtained from the
message buffer or from the panic_thread variable that the system
maintains. This variable holds the address of the thread that caused
the system to panic regardless of how many CPUs there are in the
system.
panic_thread/X
panic_thread:
panic_thread: f06c16c0
Use the thread macro. Search the structure for the procp (process
pointer) field.
f06c16c0$<thread
adb
0xf06c16c0:
link stk
0 e0922c08
0xf06c16cc:
bound affcnt bind_cpu
0 0 -1
0xf06c16d4:
flag procflag schedflag state
0 0 11 4
0xf06c16e0: pri epri pc sp
0 0 e004c13c e0922480
0xf06c16ec: wchan0 wchan cid clfuncs
0 0 2 f0371960
0xf06c1700:
cldata ctx lofault onfault
f0594700 0 0 0
0xf06c1710:
nofault swap lock cpu
0 e0921000 ff f05b2ab8
0xf06c1720:
intr delay_cv tid alarmid
0 0 1 0
realitimer
0xf06c1734: interval.sec interval.usec value.sec value.usec
0 0 0 0
0xf06c1744:
itimerid sigqueue sig
0 0 0 0
0xf06c1754:
hold forw back
0 0 f06c16c0 f06c16c0
0xf06c1764:
lwp procp next prev
f0bb0bd8 f0bb8cd0 f0aa0920 f06c1ea0
0xf06c16da:
preempt trace whystop whatstop
1 0 0 0
0xf06c17a4:
kpri_req sysnum astflag pollstate cred
11 4 0 0 f03b1c18
0xf06c178c:
lbolt pctcpu trapret pre_sys post_sys sig_check
1b520 ae 0 0 0 0
0xf06c1794:
lockp oldspl disp queue disp time
f05b2b10 de1 f05b2aec 111899
0xf06c17b8:
mstate waitrq rprof
You can use the last macro proc2u to expand the proc structure.
Locate the psargs symbol, which indicates the commands and its
arguments executing at the time of the panic. You have accomplished
the last key point (locating the process).
f0bb8cd0$<proc2u
0xf0bb8e88:
execid execsz tsize
32581 12e 0
0xf0bb8e94:
dsize start ticks cv
0 315c1813 1b513 0
0xf0bb8ea4:
exdata
0xf0bb8ea4:
vp tsize dsize bsize
0 0 0 0
0xf0bb8eb4:
lsize nshlibs mach mag toffset
0 0 0 10b 0
0xf0bb8ec4:
doffset loffset txtorg datorg
0 0 0 0
0xf0bb8ed4:
entloc
df7d43a8
0xf0bb8ed8: aux vector
7d8 dfffffe1 3 10034
4 20 5 5
9 11b54 7 df7d0000
8 0 6 1000
7d0 0 7d1 0
7d2 1 7d3 1
7d9 7 0 0
0 0 0 0
0 0 0 0
0xf0bb8f68: psargs
mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024 16 10 60 204
8 t 0 -1 8 -1^@^@^@
0xf0bb8fb8: comm
mkfs^@^@^@^@^@^@^@^@^@^@^@^@^@
0xf0bb8fd8:
sigmask
0xf0bb9050: 0 0 0 0
# cd crash_directory
# ls
bounds unix.0 vmcore.0
# adb -k unix.0 vmcore.0 1
physmem 1e6e 2
msgbuf+14/s 3
symbol not found
$q
Note – If the message symbol not found is returned, exit adb and
use the strings command.
The message buffer has been edited, but in the workshop, you will see
the full message buffer. The bold area contains the important
information about the crash. The information located in the BAD TRAP
message informs you of the type of fault detected <Date Fault> plus
it also informs you of the name of the module that caused the system
to panic (ramd).
Notice also that the pc (0xfc479dbc) points to the instruction that was
executing at the time of the crash.
The rest of the crash dump analysis will use adb macros and
commands to navigate you through a crash dump. This would be
necessary if the message buffer did not help or if one was not
available.
BAD TRAP: type=9 rp=f05246f4 addr=30 mmu_fsr=326 rw=1
BAD TRAP: occurred in module “ramd” due to an illegal access to a
user address
mkfs: Data fault
kernel read fault at addr=0x30, pme=0x0
MMU sfsr=326: Invalid Address on supv data fetch at level 3
pid=465, pc=0xfc479dbc, sp=0xf0524740, psr=0x40000c2, context=0
g1-g7: ffffff98, 0, ffffff00, 0, f05249e0, 1, fc2dec00
Begin traceback... sp = f0524740
Called from f00df9b4, fp=f05247a8, args=1a40000 f0524808 fc38fc80 f0154664
0 fc479d90
Called from f0070258, fp=f05248b8, args=200 f0524920 2 0 4 fc2d5b04
Called from f0041aa0, fp=f0524938, args=f0160cf8 f0524eb4 0 f0524e90
fffffffc ffffffff
Called from 15cc0, fp=effffae8, args=4 32400 200 0 0 3fe00
End traceback...
panic: Data fault
# cd crash_directory
# adb -k unix.0 vmcore.0
physmem 1e6e
The $c macro (1) displays the stack. Note the value 9 in the initial trap
handler (2) as it is also displayed in the message buffer. Also note, the
cmn_err() routine is called. This fault was determined to be a
nonrecoverable error ending up in a panic.
$c
complete_panic(0xf026b428,0xfbfab98c,0xf0048ec8,0x6a,0xfbfab818,0xf
0279800) + 108
do_panic(?) + 1c
vcmn_err(0xf0266600,0xfbfab98c,0xfbfab98c,0x7,0xffeec000,0x3)
cmn_err(0x3,0xf0266600,0x1,0x21,0x21,0xf025c000) + 1c
die(0x9,0xfbfabac4,0x30,0x326,0x1,0xf0266600) + bc
trap(0xf028a1d8,0xfbfabac4,0x0,0x326,0x1,0x0) + 4f8
fault(?) + 84
Syssize(via
getminor)(0x0,0x3ffff,0x20,0x7fffffff,0xf5c4b4bc,0x31585486)
ram_write(0xdc0000,0xfbfabbd8,0xf5a8ed38,0xdc,0xf5970d48,0xf5c54d90
) + 1c
write(0x5) + 190
The next step in the core dump analysis is to get the program counter
and, with the disassemble command (i), display the assembly
instruction that caused the system to panic. This will usually display
the name of the device driver routine as part of the label. With this
information, you can pinpoint precisely what device driver caused the
system to go down.
f5c98dbc/i
ram_write+0x2c: ld [%l1 + 0x30], %l2
You may also want to find out what program or command was
running when the system went down. This is additional information
that will point out the bad device driver as well.
Using adb, you do this in two steps: first, display the thread that was
running when the system went down; the thread structure has a
pointer to the process that holds the name of the command running.
Second, display the user structure of this process that has the
command name.
panic_thread/X
panic_thread:
panic_thread: f5c66480
f5c66480$<thread
adb
0xf5c66480:
link stk
0 fbfabc08
0xf5c6648c:
bound affcnt bind_cpu
f026b494 0 -1
0xf5c66494:
flag procflag schedflag state
0 0 11 4
0xf5c664a0: pri epri pc sp
14 0 f0048ec8 fbfab818
0xf5c664ac: wchan0 wchan cid clfuncs
0 0 2 f59a0378
0xf5c664c0:
cldata ctx lofault onfault
f5cb6460 0 0 0
0xf5c664d0:
nofault swap lock cpu
0 fbfaa000 ff f026b494
0xf5c664e0:
intr delay_cv tid alarmid
0 0 1 0
realitimer
0xf5c664f4: interval.sec interval.usec value.sec
value.usec
0 0 0 0
0xf5c66504:
itimerid sigqueue sig
0 0 0 0
0xf5c66514:
hold forw back
0 0 f5c66480 f5c66480
0xf5c66524:
lwp procp next prev
f5c11828 f5c0fcc8 f5c665a0 f5c66d80
0xf5c6649a:
preempt trace whystop whatstop
1 0 0 0
0xf5c66564:
kpri_req sysnum astflag pollstate cred
0 4 1 0 f5a8ed38
0xf5c6654c:
lbolt pctcpu trapret pre_sys post_sys sig_check
405b8 fd 0 0 0 0
0xf5c66554:
lockp oldspl disp queue disp time
f026b4ec be1 f026b4c8 263603
0xf5c66578:
mstate waitrq rprof
9 0 0 0
0xf5c66580:
prioinv ts sobj_ops
0 0 0
From the previous thread structure, you can get the address of the
process’s proc structure from the field that is labeled procp. When
you use this in combination with the macro proc2u, you can display
the user structure of the process that has the command or program
name. Take special note also of the arguments that were passed to the
command.
f5c0fcc8$<proc2u
0xf5c0fe80:
execid execsz tsize
32581 12e 0
0xf5c0fe8c:
dsize start ticks cv
0 31585486 405a4 0
0xf5c0fe9c:
exdata
0xf5c0fe9c:
vp tsize dsize bsize
0 0 0 0
0xf5c0feac:
lsize nshlibs mach mag toffset
0 0 0 10b 0
0xf5c0febc:
doffset loffset txtorg datorg
0 0 0 0
0xf5c0fecc:
entloc
ef7d43a8
0xf5c0fed0: aux vector
7d8 efffffe6 3 10034
4 20 5 5
9 11b54 7 ef7d0000
8 0 6 1000
7d0 0 7d1 0
7d2 1 7d3 1
7d9 3 0 0
0 0 0 0
0 0 0 0
0xf5c0ff60: psargs
mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024
16 10 60 204
8 t 0 -1 8 -1^@^@^@
0xf5c0ffb0: comm
mkfs^@^@^@^@^@^@^@^@^@^@^@^@^@
0xf5c0ffd0:
cdir rdir ttyvp cmask
f5ca82e8 0 0 12
0xf5c0ffe0:
mem systrap ttyp ttyd
4f5 0 0 0
0xf5c0fff0: entrymask
0 0 0 0
0 0 0
0xf5c1000c: exitmask
0 0 0 0
0 0 0
0xf5c10028:
signodefer sigonstack
0 0 0 0
0xf5c10038:
sigresethand sigrestart
0 0 0 0
sigmask
0xf5c10048: 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
signal
0xf5c101a8: 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 1 0
0 1 1 0
0 0 0 0
0 0 0 0
0 0 0 0
ru
0xf5c10258:
nshmseg acflag
0 0
0xf5c1025c: rlimit
7fffffff 7fffffff 7fffffff 7fffffff
7ffff000 7ffff000 800000 7ffff000
7fffffff 7fffffff 40 400
7fffffff 7fffffff
flock
0xf5c10294: owner
0
0xf5c10294: lock
0
0xf5c10294: waiters wlock type
0 0 0
0xf5c1029c:
nofiles
24
flist
f5c0c910
0xf5c1029c: ofile pofile refcnt
0xf5c0c910: f5c69758 0 0
0xf5c0c918: f5c69758 0 0
0xf5c0c920: f5c69758 0 0
0xf5c0c928: f5c69218 0 0
0xf5c0c930: f5c695d8 0 0
0xf5c0c938: f5c69188 0 0
0xf5c0c940: f5c697e8 0 0
0xf5c0c948: 0 0 0
0xf5c0c950: 0 1 0
0xf5c0c958: f5c696f8 0 0
0xf5c0c960: f5c69698 0 0
0xf5c0c968: 0 0 0
0xf5c0c970: 0 0 0
0xf5c0c978: 0 0 0
0xf5c0c980: 0 0 0
0xf5c0c988: 0 0 0
0xf5c0c990: 0 0 0
0xf5c0c998: 0 0 0
0xf5c0c9a0: 0 0 0
0xf5c0c9a8: 0 1 0
0xf5c0c9b0: 0 0 0
0xf5c0c9b8: 0 0 0
0xf5c0c9c0: 0 0 0
0xf5c0c9c8: 0 0 0
user (alias: u) Prints the user structure for the designated process.
stack (alias: s) Dumps the stack. The -u option prints the user stack.
The -k option prints the kernel stack. If no arguments
are entered, the kernel stack for the current thread is
printed. Otherwise, the kernel stack for the currently
running thread is printed.
For more information about crash commands, refer to the man pages.
# cd crash_directory
# crash vmcore.0 unix.0
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
> stat
system name: SunOS
release: 5.5
node name: mustang
version: Generic
machine name: sun4d
time of crash: Fri Mar 29 09:04:19 1996
age of system: 18 min.
panicstr: Data fault
panic registers:
pc: e004c13c sp: e0922808
> u
PER PROCESS USER AREA FOR PROCESS 34
PROCESS MISC:
command: mkfs, psargs: mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024
16 10 60 2048 t 0 -1 8 -1
start: Fri Mar 29 09:04:19 1996
mem: 450, type: exec
vnode of current directory: f0bd8688
OPEN FILES, POFILE FLAGS, AND THREAD REFCNT:
[0]: F 0xf06d3db8, 0, 0 [1]: F 0xf06d3db8, 0, 0
[2]: F 0xf06d3db8, 0, 0 [3]: F 0xf06d3938, 0, 0
[4]: F 0xf06d3ae8, 0, 0 [5]: F 0xf06d32a8, 0, 0
[6]: F 0xf06d38d8, 0, 0 [9]: F 0xf06d3878, 0, 0
[10]: F 0xf06d3848, 0, 0
cmask: 0022
RESOURCE LIMITS:
cpu time: unlimited/unlimited
file size: unlimited/unlimited
swap size: 2147479552/2147479552
stack size: 8388608/2147479552
coredump size: unlimited/unlimited
file descriptors: 64/1024
address space: unlimited/unlimited
SIGNAL DISPOSITION:
1: default 2: default 3: default 4: default
5: default 6: default 7: default 8: default
9: default 10: default 11: default 12: default
13: default 14: default 15: default 16: default
17: default 18: default 19: default 20: default
21: default 22: default 23: default 24: default
25: default 26: ignore 27: ignore 28: default
A RAM disk device driver has just been installed in your system by
your resident device driver writer, who has asked you to test the
driver.
# cd /devices/pseudo
# ls
If the RAM disk has been installed correctly, two entries are in this
directory: ramd@0:0, and ramd@0:0,raw.
5. Save the core dump and use adb to analyze the problem following
the classroom exercise template.
You can prevent this from happening if you back up the root partition.
Then when /etc files such as name_to_major, path_to_inst,
driver_classes, and driver_aliases become corrupted, you can
boot from a backup root partition that has these files intact.
1. Make sure that your system has been installed with a backup root
partition that has exactly the same size as the root partition. If
your root partition is /dev/dsk/c0t3d0s0 with 20983 Kbytes,
then your backup partition could be /dev/dsk/c0t1d0s0 with
20983 Kbytes.
3. # dd if=/dev/dsk/c0t3d0s0 of=/dev/dsk/c0t1d0s0
6. # cd /backup_root
# vi /etc/vfstab
8. Halt your system and then try to boot from the backup_root file
system.
10. If your system becomes corrupted, boot from the backup partition,
and then copy the corrupted files from the backup to the original
root partition.
# rem_drv ramd
# cd /usr/kernel/drv
# cp ramd.bad_attach ramd
3. Attach and link the new driver to the kernel. Use the sync
command several times to minimize the file system damage
because of a panic.
4. Save the core dump and use adb to analyze the problem using the
classroom exercise as a template.
Note – You may have to boot with the -a option and not put
/usr/kernel in the module path. This bug may not allow you to save
a core dump because the panic occurs in an auto-configuration routine
that gets called during boot time. When the system panics, the system
will try to reboot; and when it reboots, it will encounter the bad
attach routine and the system will go down again. This is when the
-a option to boot becomes very useful.
After describing what is wrong with the RAM disk driver, your device
driver writer reports that the writer has written another ramd and that
you are to test it. Use adb commands to modify a live kernel.
# cd /usr/kernel/drv
# test1
3. Invoke adb.
backseat_write,10/X
backseat_write,10/i
8. Use the sync command several times, then invoke test1 again.
9. Analyze the core dump so that you can tell the device driver
writer what was wrong.
You will use the ps (report process status) command and the kadb
(kernel debugger) utility. This procedure is time-consuming but
interesting. You will select one of the active processes in your system
like init, a Command Tool, more, or vi. You are going to trace
through the various structures that the operating system allocates to
processes starting with the output of the ps -le command. Then you
will use kadb to go through the structures.
Use the man pages and .h files to gain insight into the Solaris 5.x
operating system and to increase your fault analysis skills with
advanced concepts.
kadb Description
kadb is an interactive debugger with a user interface similar to that of
adb(1), the assembly language debugger. kadb must be loaded prior
to the standalone program it is to debug. It runs in the same address
space as the standalone program, thus sharing many resources with
that program but not able to use the facilities available to the system
(such as the mouse, and access to file systems) because the system is
suspended when kadb is running. Because the kernel is not running
when kadb is active, any system structure that is examined or looked
at through kadb has the current state of that structure. The debugger is
cognizant of and able to control multiple processors if they are present
in a system.
Unlike adb, kadb runs in the same supervisor virtual address space as
the program being debugged (although it maintains a separate
context). The debugger runs as a coprocess that cannot be killed (`:k')
or rerun (`:r'). There is no signal control (`:i', `:t', or `$i'),
although the keyboard facilities (Control-c, Control-s, and Control-q)
are simulated.
In the case of the UNIX kernel, the keyboard abort sequence (Stop-a
[L1-a] for console and BREAK for serial line) suspends kernel
operations and breaks into the debugger. The system will also fall into
kadb when it panics, allowing you to do an immediate analysis as to
why the system went down. You would want to use kadb when it is
not possible to save a coredump or if your dump device (swap device)
is too small to save physical memory. kadb gives the prompt kadb[#]
where # is the CPU it is currently executing on.
Note – Running under kadb has proven to be very valuable when very
bad crashes cause the machine to be so ill that it cannot generate a
dump. The analysis is the same as if running adb on a coredump.
# (Stop-a)
Note – To display a list of all kadb macros, type $M at the kadb prompt.
Simple Process
Address space
structure
Process structure
as pointer
Thread structure
Lightweight process
tlist pointer lwp pointer
1. Boot the system with kadb. Use the ps command to obtain the
starting address of your process.
2. Invoke kadb.
3. Use the process address with the proc macro, for example,
fc363000$<proc. To control the flow of information, use the
Control-q and Control-s key sequences.
Process structure
Starting address
as
ppid
pidp
cred
tlist
[as]$<as
[pidp]$<pid
[tlist]$<thread
[cred]$<cred
The seglast field contains the address of the segment that was last
used. In most cases, when the kernel needs to search a segment, it
starts with the last searched segment.
The as and seg structures are defined in header files as.h and seg.h,
which are in the directory /usr/include/vm. The thread and cred
structures are defined in the header files thread.h and cred.h in the
directory /usr/include/sys.
Segment Mapping
proc
size
[as] stack
BASE
seg
seg
size
data
seg BASE
size
text
BASE
as
PROCESS IMAGE
1. This is an additional exercise using kadb. Make sure that you have
booted the system using kadb. Log in as root, start OpenWindows,
and go to the backseat directory. Copy the backseat_hang driver
into the /usr/kernel/drv directory, but rename it as backseat. If
you have installed the working version of backseat before, do a
rem_drv of this working backseat driver before installing this
defective one into your system.
2. # cp backseat_hang /usr/kernel/drv/backseat
3. # cp backseat.conf /usr/kernel/drv
4. # rem_drv backseat
kadb[0]: $<threadlist
11. You should see a backseat driver routine calling physio() and
physio() calling biowait(); the thread is blocked in biowait(),
which is the reason why test1 is hung. Look at the man pages for
physio() and biowait() to determine what could be wrong
with the device driver. Then look at the source file for
backseat_hang, which should be backseat_hang.c, to find out
what the device driver forgot to do.
The lesson that can be learned here is that a device driver can put
a thread to sleep in such a way that the thread cannot be
awakened by a signal.
12. After this exercise, you can exit kadb and reboot your system
without kadb. You may want to sync your system first. Then press
Stop-a, and issue the $q command and boot from the ok prompt.
Bug Install
1. Ensure that NVRAM watchdog-reboot? is false.
3. Log in as root.
Note – If a watchdog error does not occur, ask the instructor for
assistance.
● .registers
● .locals
● .psr
● ctrace
8. Boot the system to the Solaris operating environment, and save the
output of the following Solaris commands:
● showrev -p
● prtconf -v
● pkginfo
● /usr/ccs/bin/nm /dev/ksyms
● /etc/system
● /var/adm/messages*
Note – You should set up a tip line to the machine that is expected to
get a watchdog reset, as this is the easiest way to save the OBP
command outputs in a file.
Note – Search the SunSolve software for watchdog reset with error
messages similar to yours.
Instead, you can find out where the failing instruction is with respect
to the entire routine so that the assembly language can be matched to
the C code. To do this, the routine is disassembled up to the problem
instruction, which occurs 2c bytes into the routine. Since each
instruction is 4 bytes, 2c/4 or 0xb additional instructions must be
displayed:
ff4eadbc/i (from Determining What Instruction Failed)
ram_write+0x2c: ld [%l1 + 0x30], %l2
ram_write,c/i
ram_write:
ram_write: sethi %hi(0xfffffc00), %g1
add %g1, 0x398, %g1 ! ffffff98
save %sp, %g1, %sp
st %i0, [%fp + 0x44]
st %i1, [%fp + 0x48]
st %i2, [%fp + 0x4c]
ld [%fp + 0x44], %o0
call getminor
nop
st %o0, [%fp - 0x4]
ld [%fp - 0x8], %l1
ld [%l1 + 0x30], %l2
After examining the ramd.c source file, these lines stand out in
ram_write:
static int
ram_write(dev_t dev, struct uio *uiop, cred_t *credp)
{
int instance;
struct ram_state *rs;
/* Comment this out in order to pass a pointer that has not been
initialized, so that you can cause a data fault and a core dump.
rs = ddi_get_soft_state(statep, instance);
if (rs == NULL) {
cmn_err(CE_NOTE,
“%s: write: could not get state for instance %d.”,
RAMDISK_NAME, instance);
return ENXIO;
}
*/
if (uiop->uio_offset >= rs->size)
return EINVAL;
Introduction
The workshop will enable you to trace and identify the processes and
files needed to open windows. You will use the ps (report process
status) command and man pages, within the reference material, to
accomplish this task.
2. Log in as root.
7. Log out.
2. Log in as root.
# /usr/openwin/bin/openwin
7. Log out.
# ps
The next ps command fills the screen with information concerning all
the processes that have been started including your login process.
Some of the processes vary from system to system.
The table (next two pages) provides some of the processes that will be
on all systems with space for other system-dependent processes.
Check the processes you have that are the same as those listed. Use the
ps -ef command to obtain base-level processes.
# ps -ef
Base-Level Processes (1 of 2)
sched*
init*
pageout
fsflush
sac
rpcbind
sctserve
sendmai
keyserv
inetd
ypbind
in.route
kerbd
automoun
statd
lockd
lpsched
syslogd
cron
vold
Base-Level Processes (2 of 2)
# /usr/openwin/bin/openwin
2. Type the ps command and determine the PID for the current
window process. Record the information in the table below:
10. Who is the parent for olmslav (Open Window Manager Slave)?
___________
13. Who is the parent for ttsession (Tool Talk Message Server)?
___________
15. Open another Shell Tool at this time. Trace the family tree.
OpenWindows Files
# view /tmp/trusstrace
Is truss another tool you can use to trace commands? Remember that
PIDs different from your original chart will be displayed because they
are not reused. The grep command can be useful at this time.
Skills Checklist
Student Instructor
Skill
Initials Initials
Invoke adb to start a kernel core dump analysis.
Invoke crash to start a kernel core dump analysis.
Use the adb string macro to display the message buffer.
Use selected adb macros to determine the process at the time of
the fault.
Use selected crash commands to determine the process at the
time of the fault.
Use the correct commands to properly exit crash or adb.
Instructor Notes
Possible Approach
Introduce the concept of data structures. Refer to the kadb workshop
as a reference for your own knowledge. Inform the student that the
study of kernel dumps is based on the knowledge of obtaining useful
information from a data structure.
Detected
condition Pass it Pass it
Handler Handler Handler Pass it Panic
Instructor Notes
The Sun-4m architecture might have a problem with the loading of the
msgbuf variable into the dynamic kernel symbol table so that the
message buffer becomes inaccessible, although this incidence seem to
have been fixed in Solaris 2.5. However, if an incorrect response occurs
with the msgbuf/s command, or $<msgbuf, tell the student to open
another window and perform a strings command on vmcore. For
example:
Then search through the displayed messages for the panic message.
Team Members
________________________________________________________
________________________________________________________
_________________________________________________________
A-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
A
Requirements
● Sun-4 systems
Resources
● AnswerBook
● SunSolve
● Diagnostics
● SunVTS
● Format
System Configurations
● Standalone
● Network
● Client-server
● NIS or NIS+
B-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Use the student1 account. The user name is student1, and the
password is student1.
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
After disconnecting the keyboard, and using the ASCII terminal, the
system “hangs” during boot.
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
#!/bin/csh -f
clear
rm -f /tmp/guilty_party
cat > /tmp/guilty_party << Done
#!/bin/csh -f
while (1)
end
Done
chmod 777 /tmp/guilty_party
/usr/bin/priocntl -e -c RT /tmp/guilty_party &
/usr/sbin/psradm -f a
Diagnostic Steps
Use the following procedure to determine what is causing your system
to “hang.”
Note – You may need to press Stop-a several times before the
keyboard interrupt is handled.
cd directory_with_core_dumps
9. Type proc.
11. For each process entry, examine the utime and stime fields. The
combined total of these fields is total CPU time being used by the
process.
Expected Repair
A workaround is to not run the trouble program (guilty_party) until
CPU resources are available. You also need to determine if it is normal
behavior for this process to use so much CPU time. Or run
guilty_party as a timesharing process (not real time).
Repair verification
Rerun the start command to verify that this process is the culprit.
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
# swap -s
3. Type the swap -l command and record the values in Table 2 page
49 (in the “Before OpenWindows” column).
# swap -l
# mkdir /test
# /usr/openwin/bin/openwin
# /test/SUNWdiag/bin/sundiag
10. Deselect all tests and then select the kmem test.
11. Record the value of swap space indicated in the kmem option box.
13. Total physical memory must also take into account the pages
required by the kernel. The total memory minus the memory of
the kernel equals available physical memory. Check the dmesg for
size of kernel memory.
14. The total disk swap space minus the available physical equals
memory swap space.
15. Run two passes of kmem tests and record the time required to
complete the tests. This will be the base time.
16. While the test is running, you can monitor the behavior of swap
space, using the swap commands. Record the value of the first
swap commands in Tables 1 and 2 on page 49 (in the “During
SunDiag” column).
17. If the test passes, add fpu and one device for a fstest.
18. Run two passes of new tests and record the time required to
complete them. This will be the loaded base time.
If the test is successful and the virtual and physical links are
functional, the system administrator can add the partition to the
/etc/vfstab using a shortcut method:
# cp /etc/vfstab /etc/vfstab.orig
# mount -p > /etc/vfstab
umount /test
# init 6
# /usr/openwin/bin/openwin
# /test/SUNWdiag/bin/sundiag
6. Run two passes of kmem tests and record the time to complete.
This will be the new base time.
7. Compare the “new base time” with the original “base time.” Is it
faster, slower, “hung,” stopped, or the same? Why?
___________________________________________________________
___________________________________________________________
8. If the test passes, add fpu and one device for a fstest.
9. Run two passes of new tests and record the time to complete. This
will be the new loaded base time.
10. Compare the “new loaded base time” with the original “loaded
base time.” It is faster, slower, stopped, “hung,” or the same?
Why?
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
Before After
Parameter During SunDiag
OpenWindows OpenWindows
Bytes allocated
Bytes reserved
Total bytes
Bytes available
Before After
Parameters During SunDiag
OpenWindows OpenWindows
Current blocks
Free blocks
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
In the steps below, using adb on the live kernel, you will lower the
value of maximum number of processes allowed per user. Then you
will open various windows (processes) until an error occurs informing
you Resource temporarily unavailable.
v$<v
maxup= _______
nproc/D
nproc = _______
In the next step, you will reduce the value of maxup and another
variable that controls the maximum number of processes per user system
wide. The reduced value should be about 5 more than the current
nproc value.
When you deposit the value into the proc field, the value is
entered as a base(16) notation. The command v+1c/W xx, where xx
is your input value, enables you to change a kernel parameter
using adb.
Example:
Note – Do not change values in the kernel using this method. This is
for an academic learning experience only. But be aware that it can be
done.
9. Calculate the value of nproc for your system, using the Calculator
utility, if necessary. Then replace v+1c with the calculated value.
v$<v
nproc/D
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
Note – To restore maxup back to its original value, convert the original
value into a base(16) value. Using the v+1c/W xx, where xx is the
base(16) value of original value of maxup(10). Use the Calculator utility,
if necessary. Do this also to maxupttl which is in v+c, enter
v+c/W0tdd, where dd is the original value of maxup in decimal.
12. Return maxup and maxupttl back to its original value, and exit
adb.
Fault Worksheet #
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Fault Worksheet #
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
C-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
C
Fault Worksheet #17 - The ps Command Returns Nothing
Note – The next three faults are recommended for use in Module 1
exercises.
Fault Worksheet #
Error Symptoms/Conditions/Messages
Fault Insertion
1. If available, use a bad video cable.
Likely Cause
A defective cable, monitor, or frame buffer; incorrect map in SBus
probe list; or defective system board.
Possible Fix
Install a new cable, or use a power-on, Stop-n (L1-n) open boot PROM
(OBP) command.
Lesson
The students will learn to use tip or ASCII to observe the power-on
self test (POST). Use OBP commands and diagnostics to isolate fault.
Example Insertion #1
ok show sbus-probe-list
sbus-probe-list 0123
ok show-sbus
SBus slot 0 le esp dma
SBus slot 1
SBus slot 2 cgsix
SBus slot 3
ok setenv sbus-probe-list 0134
ok reset
Error Symptoms/Conditions/Messages
Fault Insertion
● If available, use a bad SCSI cable or a bad desktop storage pack
(DSP).
Likely Cause
A defective cable, terminator, or disk or an incorrect map in NVRAM.
Possible Fix
Add a new cable, terminator, or bad system board; or reset the
NVRAM parameters.
Lesson
The students will learn to use OBPR to determine fault.
Example Insertion #2
ok show sbus-probe-list
sbus-probe-list 0123
ok show-sbus
SBus slot 0 le esp dma
SBus slot 1
SBus slot 2 cgsix
SBus slot 3
ok setenv sbus-probe-list 1234
ok reset
Error Symptoms/Conditions/Messages
Fault Insertion
1. Select a partition to corrupt; normally the /home or /export/home
directory is used.
Likely Cause
File corruption during crash
Possible Fix
A standard fix is to use the fsck command with the standard backup
block 32 to repair file system (if it can be repaired). Otherwise, use
alternate backup blocks. To locate the alternate backup blocks, use the
newfs -N.Then use alternate blocks in the fsck command. Use the
AnswerBook software or man pages for proper use on an operating
system.
Lesson
In this important lesson, the students will learn how to locate and use
alternate super backup blocks using newfs -N in conjunction with
fsck.
Example Insertion #3
# umount /export/home
# dd if=/dev/rdsk/c0t3d0s0 of=/dev/rdsk/c0t3d0s7
count=35
# halt
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the /etc/system file to reflect a non-existing root device.
#cp /etc/system /etc/system.orig
Likely Cause
Could be a hardware problem. For example, the device with an
incorrect target was selected. The SBus may have been misconfigured.
Possible Fix
Verify system configuration using OBP. Use the boot -a command
and specify /dev/system.orig for the /etc/system file.
Lesson
The students will learn one of the many uses of the /etc/system file.
In this example, the user is using /etc/system to change the root
device after the initial boot and kernel were loaded from a different
device.
Example Insertion #4
rootdev:/sbus@1,f8000000/esp@0,800000/sd@3,0:a
rootdev:/sbus@2,f8000000/esp@0,800000/sd@3,0:a
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the /etc/passwd database to reflect an improper shell.
Likely Cause
Corrupted password file. The problem may indicate a hasty
modification of the passwd file.
Possible Fix
Correct the /etc/passwd file.
Lesson
The students will learn to use an alternate boot device to correct an
error on another partition.
Example Insertion #5
root:x:0:1:0000-Admin(0000):/:/sbin/sh
root:x:0:1:0000-Admin(0000):/:/sbin/csh
Error Symptoms/Conditions/Messages
Fault Insertion
Use the touch command to insert some null v files in one or more of
the user search paths.
Likely Cause
Corrupted adb macro library. The search path for adb macros may
have been modified.
Possible Fix
Locate and remove all null v from search paths.
Lesson
A user ran the v$>v macro to create an incorrect v within the user’s
default command search path while using adb on the live kernel. The
correct location is /usr/kvm/lib/adb/v.
Note – This fault workshop can only be given after the students have
learned to use adb on a live kernel.
Example Insertion #6
Fault #7 - Feckless
Error Symptoms/Conditions/Messages
mkdir: Failed to make directory "test"; Operation not
applicable
touch: test cannot create
vi “Operation not applicable”
Fault Insertion
Add directory /feck as an auto_home directory in
/etc/auto_master.
Likely Cause
Permissions (ls -l /feck)
Possible Fix
Lesson
Example Insertion #7
(Before alteration)
# Master map for automounter
#
+auto_master
/net -hosts -nosuid
/home auto_home
/- auto_direct
2. # /usr/sbin/automount
2. # cd /
3. # umount /feck
4. # /usr/sbin/automount
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the /etc/system file and reduce the maximum number of
users to 0.
Likely Cause
● Corrupted kernel
● Careless operator
Possible Fix
Type the boot -a command, and try to bypass files if possible. Correct
the /etc/system file.
Lesson
The students will learn about the /etc/system file. System
administrators can modify kernel parameters during the boot
sequence in the /etc/system file.
Example Insertion #8
set maxusers=40
set maxusers=0
Error Symptoms/Conditions/Messages
The pg command just hangs.
Fault Insertion
Change the major and minor numbers for the tty drivers.
Likely Cause
Bad command.
Bad device.
Possible Fix
Fix tty in devices/pseudo.
Lesson
What can appear to be a very “minor” problem can portend something
quite disastrous.
Example Insertion #9
a. Edit /etc/passwd
su:x:0:1::/usr/su:/sbin/sh
guest1:x:12:10::/export/home/guest1:/bin/csh
b. Edit /etc/shadow
root:YX4pytcVVZF2k:9555::::::
su::9555::::::
guest1:oWU/elsH4pe6E:::::::
The students will eventually reboot and then notice that they
cannot log in. Then you can tell them about the su user, let them
figure out that the su user does not require a password.
Error Symptoms/Conditions/Messages
Fault Insertion
Ensure that the .profile and .login files exist within the root
directory. Modify these files to emulate the above symptoms.
In both the /.profile and /.login for root, place a pound sign (#) in
front of the line that starts the OpenWindows environment. Remove
all echo lines and shorten the sleep time to less than 5 seconds.
Likely Cause
● Hasty operator error
● Possible joker on the system
● Possible file corruption
Possible Fix
● Try a Control-c sequence to stop the login script from executing.
● Boot up on an alternate device to edit the correct files.
Lesson
The students will become acquainted with more files that effect the
logging sequence: Bourne shell /bin/sh, C shell /bin/csh, Korn shell
/bin/ksh, and Restrictive shell /bin/rsh.
% ls -a /
If it is not, type:
# cp /etc/skel/local.profile /.profile
echo ""
echo "Starting OpenWindows in 5 seconds (type Control-C to
interrupt)"
sleep 5
$OPENWINHOME/bin/openwin
clear
exit
fi
# echo ""
# echo "Starting OpenWindows in 5 seconds (type Control-C to interrupt)"
sleep 1
# $OPENWINHOME/bin/openwin
# clear
exit
fi
% ls -a /
If it is not, type:
# cp /etc/skel/local.login /.login
# if possible, start the windows system. Give user a chance to bail out
#
if ( `tty` == "/dev/console" && $TERM == "sun" ) then
if ( ${?OPENWINHOME} == 0 ) then
setenv OPENWINHOME /usr/openwin
endif
echo ""
echo -n "Starting OpenWindows in 5 seconds (type
Control-C to interrupt)"
sleep 5
echo ""
# $OPENWINHOME/bin/openwin
clear # get rid of annoying cursor rectangle
logout # logout after leaving windows system
endif
# if possible, start the windows system. Give user a chance to bail out
#
if ( `tty` == "/dev/console" && $TERM == "sun" ) then
if ( ${?OPENWINHOME} == 0 ) then
setenv OPENWINHOME /usr/openwin
endif
echo ""
echo -n "Starting OpenWindows in 5 seconds (type
Control-C to interrupt)"
sleep 1
echo ""
# $OPENWINHOME/bin/openwin
clear # get rid of annoying cursor rectangle
logout # logout after leaving windows system
endif
Error Symptoms/Conditions/Messages
Fault Insertion
Comment out the user datagram protocol (udp) parameter in the
/etc/netconfig file.
Likely Cause
● Network problem
Possible Fix
Correct the /etc/netconfig file.
Lesson
The students will learn to determine the files necessary for proper
network functions.
Error Symptoms/Conditions/Messages
Fault Insertion
Corrupt the /devices/pseudo/conskbd@0:kbd file.
Likely Cause
The user corrupted or installed files incorrectly.
Possible Fix
Reboot the system with the -r argument.
Lesson
This bug can occur when someone is using the Solaris 4.x operating
system (from the CD-ROM in this case) to repair Solaris 5.x systems
due to the significant differences in the way I/O devices are addressed
and the way the device file is linked in the Solaris 5.x operating
system.
/opt/st350_scripts/infault12
/opt/st350_scripts/outfault12
Error Symptoms/Conditions/Messages
Fault Insertion
During the OpenWindows initialization, a corrupted file will be
encountered. The system goes into shutdown. To make this even
better, add another user on the same system that can open windows.
Likely Cause
A corrupted file
Possible Fix
Locate and remove or replace the corrupted file.
Lesson
The students will learn about many of the OpenWindows startup files
found within user directories.
2. Start Clock and Calculator, and then close both, but do not quit
them.
2. Start Clock and Calculator, and then close both, but do not quit
them.
Error Symptoms/Conditions/Messages
Fault Insertion
Prevent lpsched from running by removing or corrupting the lp from
the /etc/passwd files. You can also prevent lpsched from starting up
at boot time by renaming a file in /etc/rc2.d.
Likely Cause
● Printer hardware malfunction
● Printer files
Possible Fix
● New printer or cables
● Correct files
Lesson
The students will become familiar with network printers.
lp:x:71:8:0000-lp(0000):/usr/spool/lp
Error Symptoms/Conditions/Messages
Fault Insertion
Corrupt the boot block, or corrupt the boot file /ufsboot.
Possible Fix
● Replace disk
Lesson
The students will understand the files related with the boot sequence
and be able to restore files if possible.
The 4boot file is from a Sun-4 machine. Use this file on Sun-4m, Sun-
4d, and Sun-4c architecture systems.
# cp /opt/st350_scripts/4boot /ufsboot
# cp /opt/st350_scripts/4mboot /ufsboot
# /usr/sbin/installboot /usr/lib/fs/ufs/bootblk
/dev/rdsk/device_name
# halt
ok boot
Error Symptoms/Conditions/Messages
Fault Insertion
Change the run level from 3 to 5 in the /etc/inittab file, or change the
run level from 3 to 6 in the /etc/inittab file.
Likely Cause
Hasty operator error (OPE)
Possible Fix
Start the system as a single user and correct the /etc/inittab file.
Lesson
The students will learn about the /etc/inittab in Solaris 2.x
operating environment.
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the option field in /etc/vfstab in the /proc entry.
Likely Cause
● Hasty operator error
● Corrupted software
Possible Fix
● Install new software.
Lesson
The students will learn about other files that are required for system
operations.
● Before alteration:
● After alteration:
# reboot
Error Symptoms/Conditions/Messages
Fault Insertion
Change /etc/nsswitch.conf to the wrong service.
Likely Cause
Network files have been modified.
Possible Fix
Select the correct name service and correct related files.
Lesson
Submitting Engineer: Staff
nsswitch
● Before alteration:
● After alteration:
nsswitch.nisplus
● Before alteration:
● After alteration:
# reboot
Error Symptoms/Conditions/Messages
Fault Insertion
Add a file to /etc/rc3.d.
Likely Cause
The system administrator or programmer forgot the file existed.
Possible Fix
● Network files have been modified.
Lesson
The students will be become familiar with the ifconfig command
and rc scripts.
Use the student1 account. The user name is student1, and the
password is student1.
Error Symptoms/Conditions/Messages
Fault Insertion
Change mouse speed in /usr/openwin/lib/Xdefaults.
Likely Cause
● Bad mouse
Possible Fix
Repair the file.
Lesson
The students will learn to check files beyond the GUI.
# vi /usr/openwin/lib/Xdefaults
OpenWindows.MouseAcceleration: 2
OpenWindows.MouseThreshold: 5
● After alteration:
OpenWindows.MouseAcceleration: 800
OpenWindows.MouseThreshold: 1
Ensure that the student1 account is used for the first time in the
OpenWindows environment.
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the banner using eeprom commands.
Likely Cause
Unhappy employee
Possible Fix
Use eeprom commands to restore the correct logo. Refer to Student
Guide fault for acceptable fix.
Lesson
The students will learn to use on-line eeprom commands.
ok banner
There is no outfault.
ok set-default oem-logo?
Error Symptoms/Conditions/Messages
No umptx.
Cannot write to /var.
Fault Insertion
Change rcS script to point to vfstap instead of vfstab.
vfstap has root mounted read only.
Likely Cause
Bad vfstab
Bad rc scipts
Possible Fix
Fix rcS.
Lesson
Put echos in rc scripts to determine exactly where problems are
showing up.
Submitting Engineer: JD
(before edit)
vfstab=/etc/vfstab
(after edit)
vfstab=/etc/vfstap
2. # cp /etc/vfstab /etc/vfstap
# vi /etc/vfstap
before:
(before edit)
/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0 / ufs 1 no -
(after edit)
#/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0 / ufs 1 no ro
4. # cd /etc
# touch -am 01121234 * (important)
(before edit)
vfstab=/etc/vfstap
after:
vfstab=/etc/vfstab
(unbugged)
/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0 / ufs 1 no -
(bugged)
#/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0 / ufs 1 no -
Error Symptoms/Conditions/Messages
Fault Insertion
Insert the fault using the terminal information compiler (tic).
Likely Cause
The software requires recompiling.
Possible Fix
Modify and recompile the new terminal software.
Lesson
The students trace software containing terminal information and use
the tic utility.
● infocmp
● tic
● terminfo
● Parameter definitions
● cuul - cursor u
Error Symptoms/Conditions/Messages
Fault Insertion
A change in command search paths. Add modified files to insert
errors.
Likely Cause
● Modified files
● Virus
● Hacker
Possible Fix
Locate and repair all files.
Lesson
The students will learn how to respond when they suspect a hacker
has entered the system.
# cp /etc/skel/local.login /.login
# cp /etc/skel/local.cshrc /.cshrc
# cp /etc/skel/local.profile /.profile
Note – One of the lessons of this bug is that if you are running in the
bourne shell, openwin automatically prepends /usr/openwin/bin to
your path.
Error Symptoms/Conditions/Messages
Fault Insertion
Move the OpenWindows environment to the lost+found directory as
an inode entry.
Likely Cause
The file was lost or corrupted during a system crash.
Possible Fix
● Check whether the file can be opened in the correct lost+found
directory.
Lesson
The students will learn to locate the correct lost+found directory.
# ls -i /usr/openwin/bin/openwin
46200 /usr/openwin/bin/openwin
# mv /usr/openwin/bin/openwin /usr/lost+found/#46200
# file ’#42600’
# mv ’#46200’ /filename
7. You can also use the Xinit file to create a similar problem.
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the /etc/default/login file.
Likely Cause
● Malfunctioning ASCII terminal
Possible Fix
Modify /etc/default/login to enable ASCII terminal login.
Lesson
The students will learn about a simple security feature, which is useful
if they require root access from an ACSII port connected to the system.
CONSOLE=/dev/console
#CONSOLE=/dev/console
CONSOLE=/dev/console
After disconnecting the keyboard, and using the ASCII terminal, the
system “hangs” during boot.
Error Symptoms/Conditions/Messages
Fault Insertion
Enable logins to port a and verify that you can login. There is no fault
to insert. Just use the normal setup for an ASCII terminal as a user
terminal.
Likely Cause
● Conflict of ports during boot
Possible Fix
Modify boot parameters or come up in a different port.
Lesson
The students will learn about using the ASCII terminal as a console
device during boot.
Cannot have 2 zsmons working on one port: login plus the one you
enabled.
1. # admintool
2. Click edit.
3. Edit ports.
4. Add port a as hardwired to your terminal type (if tipping in, use
sun-cmd).
During boot, the ASCII terminal is being used as the console device.
At some point in the boot sequence, the system software tries to
configure the port as an ASCII user terminal. This creates a conflict of
resources, which hangs the system.
Error Symptoms/Conditions/Messages
Fault Insertion
This fault is for networks or hosts using a twisted-pair network using
the RJ connectors.
Likely Cause
● Network files
● Network hardware
● Network cables
Possible Fix
● Verify network hardware connections.
Lesson
The students will learn to use the diagnostics.
3. Reconnect to workstation.
Fault 29 - Where It Is At
Error Symptoms/Conditions/Messages
“Syncing, done”
Fault Insertion
Start an at process, that calls init 5, halt or reboot.
Likely Cause
Cron job.
“Trojan Horse.”
Faulty rc script.
Faulty at script.
Possible Fix
at -l will show running at scripts. Once found, the script can be read
and the execution script removed.
Lesson
Tracing rc scripts, cron, and at (at -l).
2. # vi tst
#!/bin/csh
at -c -m now + “1minute” < /bin/tst
mail root </bin/tst
sync
init 5
sleep 7
/etc/init 5
sleep 7
/sbin/init 5
sleep 7
/usr/sbin/halt
sleep 7
/etc/halt
sleep 7
/etc/reboot
sleep 7
/usr/sbin/reboot
(save vi session)
3. # chmod +x tst
4. # /bin/tst
2. # cd /bin
Or
# rm /bin/tst
Error Symptoms/Conditions/Messages
ls returns nothing.
Fault Insertion
Share the CD-ROM the wrong way.
Likely Cause
rpcd
mountd
Bad server CD-ROM.
Bad server sharetab.
Bad procedure.
Possible Fix
Share CD-ROM properly (see “Removing the Bug”)
Lesson
Proper way to share CD-ROM files.
4. On the client:
# cd /mnt
# ls (should see something like:)
cdrom 0 sunsolve_2_X
# cd cdrom0
Or
# cd sunsolve_2_X
# ls (will show nothing)
Solution 1
Solution 2
1. On the server, use the mount command to find out how the CD-
ROM is mounted.
# mount
/ on /dev/dsk/c0t3d0s0 read/write/setuid on Wed Nov
15 14:35:33 1995
/usr on /dev/dsk/c0t3d0s6 read/write/setuid on Wed
Nov 15 14:35:33 1995
/proc on /proc read/write/setuid on Wed Nov 15
14:35:33 1995
/dev/fd on fd read/write/setuid on Wed Nov 15
14:35:33 1995
/tmp on swap read/write on Wed Nov 15 14:35:43 1995
/export/home on /dev/dsk/c0t3d0s7 setuid/read/write
on Wed Nov 15 14:35:45 1995
/cdrom/sunsolve_2_7 on
/vol/dev/dsk/c0t6/sunsolve_2_7 read only on Wed Nov
15 14:36:34 1995
Error Symptoms/Conditions/Messages
NA
Fault Insertion
Have them do the “Applying the Bug” after the adb and kadb
coredumps and such.
NA
Likely Cause
NA
Possible Fix
NA
Lesson
Power of adb.
Note – If there is more than 1 instance of more, stop the other one.
9. Near the end of the proc2u output you will see a list of “ofiles.”
These are the virtual addresses of stdin, stdout, stderr. The last
one is the address of the file more is using.
15. ok boot
16. ok xxxx0e 1000 pgmap (where xxxx is the page number you wrote
down in step 13)
You will see the “shadow” file. Do you think this is a security hole?
Error Symptoms/Conditions/Messages
Fault Insertion
Place a space in the CONSOLE file /etc/default/login.
Likely Cause
An incorrect file was used for login processes.
Possible Fix
Locate and correct the file.
Lesson
The students will learn to check for spaces in all the incorrect places.
CONSOLE=/dev/console
#CONSOLE=/dev/console
CONSOLE= /dev/console
Error Symptoms/Conditions/Messages
ypwich “not bound to domain xxx”
rup “hangs”
Fault Insertion
Misdirect /etc/hostname.le0 to point to localhost.
Note – This was a “Great and common” bug in 2.4 and would occur if
there were any extra carriage returns in /etc/hostname.le0; however
that was fixed in 2.5 files.
Likely Cause
Bad interface.
Bad le0 driver.
Bad rc script.
Possible Fix
Manually unplumb and plumb the interface, then ifconfig it in:
#unplumb le0
#plumb le0
#ifconfig le0 192. . ..<address from /etc/hostsa> up
Fix hostname.le0
Lesson
Manual ifconfig where the “auto ifconfig happens”
Easy Way
1. # vi /etc/hostname le0
(before edit)
machine.nodename (loghost from /etc/hosts)
(after edit)
localhost
Better Way
1. # mv /usr/sbin/sync /usr/sbin/sunc
2. # vi /usr/sbin/sunc
#!/bin/sh
cp /etc/hostname.le0 /etc/.hostname.le0.orig
cat > /etc/hostname.le0 << FR
localhost
FR
/usr/sbin/sunc
exit
(save vi session)
3. # chmod +x /bin/sunc
4. # vi /etc/rcS
(Or Better)
# cp /usr/sbin/sunc /usr/sbin/sync
Error Symptoms/Conditions/Messages
Fault Insertion
The students will insert the fault.
Likely Cause
● Hardware or software problem
● Lock resource
Possible Fix
Lesson
The students will learn to determine the cause of a hung system.
#!/bin/csh -f
clear
rm -f /tmp/guilty_party
cat > /tmp/guilty_party << Done
#!/bin/csh -f
while (1)
end
Done
chmod 777 /tmp/guilty_party
/usr/bin/priocntl -e -c RT /tmp/guilty_party &
2. Start the program for each processor you have. As soon as system
stops accepting input or radically slows down, press L1-A (Stop A)
several times.
Error Symptoms/Conditions/Messages
No shcat
No answer to rarp
Fault Insertion
Likely Cause
shcat
ifconfig
net
Possible Fix
Fix script.
Lesson
Bring up sequence.
Submitting Engineer: JD, staff. Note this bug caught me while I was
working up the netmask.le0 bug.
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the entry in the /etc/nsswitch.conf file.
Likely Cause
● Password or shadow file has been modified.
● Software is corrupted.
Possible Fix
Repair files.
Lesson
This is an important lesson concerning logging in to the system. When
a user cannot log in to the system, the /etc/nsswitch.conf file is
checked first.
# the following two lines obviate the "+" entry in /etc/passwd and
/etc/group.
passwd: files nis
group: files nis
netgroup: nis
# the following two lines obviate the "+" entry in /etc/passwd and
/etc/group.
passwd: Files nis
group: files nis
netgroup: nis
Error Symptoms/Conditions/Messages
“no shell”
Fault Insertion
Change permissions in the /usr directory.
Likely Cause
● Hacker
● Kernel
Possible Fix
1. Change vfstab to make the machine a “Dataless” client.
Lesson
“Dataless Client”
Error Symptoms/Conditions/Messages
Fault Insertion
Remove the ftp service from the /etc/inetd.conf file.
Likely Cause
● Network problem
Possible Fix
Restore correct files.
Lesson
The students will learn about the files required to provide network
services.
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the network files /etc/hostname.le0 or .ie0 and
/etc/hosts.
Likely Cause
● Network files were not properly restored after testing
● Network problem
Possible Fix
● Restore files.
Lesson
The students will learn to check the files for proper network
operations.
# cat /etc/hostname.le0
edu8
# cat /etc/hostname.le0
edu81
● /etc/hosts
● /etc/hostname.le0 or .ie0
The devices known as ticlts, ticots, and ticotsord are used for
“loopback providers.” Refer to Solaris manuals for more information.
● /etc/net/ticlts/hosts
● /etc/net/ticots/hosts
● /etc/net/ticotsord/hosts
Error Symptoms/Conditions/Messages
Fault Insertion
Fault insertion is performed by the students during the workshop.
Likely Cause
This fault workshop deals with swap space.
Possible Fix
Add more swap space or memory.
Lesson
The students will learn about swap space and performance
perceptions.
Error Symptoms/Conditions/Messages
Fault Insertion
Modify the network files located in /tftpboot.
Likely Cause
Possible Fix
Lesson
Submitting Engineer: Staff
# ls tftpboot
8196D402.SUN4M inetboot.sun4c.Solaris_2.4
8196D443.SUN4M inetboot.sun4m.Solaris_2.4
8196D445.SUN4C tftpboot
8196D445.SUN4M
Shut down the diskless client before modification. Use the move
command.
# ls tftpboot
8196D403.SUN4M inetboot.sun4c.Solaris_2.4
8196D443.SUN4M inetboot.sun4m.Solaris_2.4
8196D445.SUN4C tftpboot
8196D445.SUN4M
Error Symptoms/Conditions/Messages
Fault Insertion
Set up student account with a restricted shell.
Likely Cause
File access permissions have been changed.
Possible Fix
Change login shell.
Lesson
Now the students will have knowledge of a Solaris shell used in
securing system users.
Restricted Shell
Restricted shells do not enable these operations:
● Changing directories
● Redirecting output
Error Symptoms/Conditions/Messages
Cannot use keyboard, openwin.
Fault Insertion
Change permissions on /devices.
Likely Cause
● Sysadmin
● Security issues
Possible Fix
Fix /devices.
Lesson
Using diff.
Error Symptoms/Conditions/Messages
OpenWindows hangs.
Fault Insertion
Remove user access to /tmp
Likely Cause
● Bad openwin directory
Possible Fix
1. Mount OpenWindows from another system
3. Fix /tmp
Lesson
Using Sunsolve:
Error Symptoms/Conditions/Messages
% nispasswd
“permission denied”
Fault Insertion
Install the user on the net root server (as root) using admintool &.
Give them a normal password = username
Likely Cause
Nis+
Possible Fix
Change password as root.
Lesson
nispasswd
SunSolve
Install the user on the network root server (as root) using admintool
&. Give them a normal password = username
Error Symptoms/Conditions/Messages
cutest% rlogin -l root proto2
permission denied
cutest% rsh proto2
permission denied
cutest% telnet proto2
Trying 129.150.28.75 ...
Connected to proto2.
Escape character is '^]'.
220 proto2 FTP server (UNIX(r) System V Release 4.0)
ready.
500 '': command not understood.
Fault Insertion
On the server, partially reverse the order of procedures in
/etc/inetd.conf.
Likely Cause
● /etc/default/login
● /etc/services
● /etc/netconfig
● /etc/inetd.conf
Possible Fix
Fix the file /etc/inetd.conf.
Lesson
● rpcinfo
● kill hup
(before edit)
ftp stream tcp nowait root /usr/sbin/in.ftpd in.ftpd
telnet stream tcp nowait root /usr/sbin/in.telnetd
in.telnetd
shell stream tcp nowait root /usr/sbin/in.rshd in.rshd
login stream tcp nowait root /usr/sbin/in.rlogind
in.rlogind
(after edit)
ftp stream tcp nowait root /usr/sbin/in.telnetd
in.telnetd
telnet stream tcp nowait root /usr/sbin/in.ftpd in.ftpd
shell stream tcp nowait root /usr/sbin/in.rlogind
in.rlogind
login stream tcp nowait root /usr/sbin/in.rshd in.rshd
(save vi session)
# kill -1 (pid of inetd from above)
The fact that they were the same size, as another machine we compared
with, made it real tough! (JD)
# touch -am 01121234 *
Students just reversed the first word on each of these services (that is,
changed the word shell to login, changed the word telnet to ftp)
in order to breakpoint the rpc initialization. (JD)
Error Symptoms/Conditions/Messages
program not registered: RPC TIMEOUT
Or
Fault Insertion
Bug occurs two ways so put both in (at your option):
On the server:
Likely Cause
Over zealous security administrator.
Possible Fix
Reinstall or fix files.
Lesson
rpc calls
1. # cp /etc/services /etc/.services
2. # vi /etc/services
(before edit)
ftp-data 20/tcp
ftp 21/tcp
(after edit)
# ftp-data 20/tcp
# ftp 21/tcp
Or
1. # cp /etc/nsswitch.conf /etc/.nsswitch.conf
2. # vi /etc/nsswitch.conf
(before edit)
services: nis files
(after edit)
# services: nis files (or delete line altogether)
Note – If you have nis or nis+ running in your lab you must change
the nsswitch.conf file or else the bug does not work (machine picks up
services from the nameserver)!
4. # reboot
Error Symptoms/Conditions/Messages
Tough to see but access through mounts gets slow when several
systems try moving or copying data across mounts.
Fault Insertion
Reduce the amount of server daemons spawned.
Likely Cause
● NFS problem
● Network problem
Possible Fix
Spawn more server daemons by hand. On the Server:
# /usr/lib/nfs/nfsd -a 32
Lesson
● ps -l
● nfsd
● SunSolve
● spray
1. # cd /etc/rc3.d
3. # vi S15nfs.server
(before edit)
if grep -s nfs /etc/dfs/sharetab >/dev/null ; then
/usr/lib/nfs/nfsd -a 16
(after edit)
(save vi session )
5. # reboot
JD
Error Symptoms/Conditions/Messages
Fault Insertion
Remove read permissions for users from /dev/pts (all users then
must call kernel threads to read anything).
Likely Cause
boot -p
Possible Fix
Change permissions back.
Lesson
SunSolve
If you change the permissions for /dev/pts so others cannot read, it will
slow—not stop—the machine down.
The solution is to run truss. If you use truss to see what is going on inside
a user process, you will see it.
How did the bug appear? Do boot -p and you have all the trouble in the
world.
Error Symptoms/Conditions/Messages
Hostname not found.
Fault Insertion
Remove hostname A from testing machine’s /etc/hosts file.
Likely Cause
System administrator oversight.
Possible Fix
Fix /etc/hosts file.
Lesson
How to isolate problems using fault analysis techniques.
# cd /etc
# cp hosts .host.orig
# vi hosts
(before edit)
127.0.0.1 localhost
129.150.28.39 forward loghost
129.150.182.68 scha
(after edit)
l27.0.0.l localhost
l29.l50.28.39 forward loghost
l29.l50.l82.69 Scha
Error Symptoms/Conditions/Messages
No answer from any ping or rup.
Fault Insertion
Change the 1 in /etc/hosts to small letter l.
Likely Cause
Incorrect install or ifconfig or sys-unconfig.
Possible Fix
sys-unconfig
New ifconfig
Lesson
How to isolate problems using fault analysis techniques.
(before edit)
127.0.0.1 localhost
129.150.28.39 forward loghost
129.150.182.68 scha
129.150.182.82 hc-dn
129.150.182.86 ha-dog
(after edit)
l27.0.0.l localhost
l29.l50.28.39 forward loghost
l29.l50.l82.68 scha
l29.l50.l82.82 hc-dn
l29.l50.l82.86 ha-dog
(Use :%s/1/l/g)
# touch -am 01121234 *
# reboot
Bug works good on its own but best when combined as suggested in
Instructor Notes for Module 1.
Error Symptoms/Conditions/Messages
Machines cannot talk to machine C by name. Machine C cannot talk to
other machines on the same subnet.
Fault Insertion
Have /etc/hostname.le0 point to an incorrect host.
Likely Cause
System administrator oversight.
Possible Fix
Fix /etc/hosts or /etc/hostname.le0 file.
Lesson
How to isolate problems using fault analysis techniques.
# cd /etc
# cp hosts .host.orig
# cp hostname.leo .hostname.leo
# vi hostname.leo
(before edit)
machinec’s nodename
(after edit)
scha (a hostname that exists in /etc/host)
# vi hosts
(before edit)
127.0.0.1 localhost
129.150.28.39 forward loghost
129.150.182.68 scha
129.150.182.82 hc-dn
129.150.182.86 ha-dog
129.150.182.85 ha-network
129.150.182.88 ha-bud
129.150.182.89 relo-network
129.150.182.90 relo-dog
(after edit)
127.0.0.1 localhost
129.150.28.39 forward loghost
129.150.183.68 scha
129.150.182.82 hc-dn
129.150.182.86 ha-dog
129.150.182.85 ha-network
129.150.182.88 ha-bud
129.150.182.89 relo-network
129.150.182.90 relo-dog
# touch -am 01121234 *
# reboot
Error Symptoms/Conditions/Messages
Fault Insertion
There is no fault to insert.
Likely Cause
Possible Fix
Lesson
Submitting Engineer: Rolando Dizon
No fault to insert.
Fault #
Error Symptoms/Conditions/Messages
Fault Insertion
Likely Cause
Possible Fix
Lesson
Submitting Engineer: Staff
Example Insertion #
Suggested Schedule