Professional Documents
Culture Documents
Dell PowerEdge Troubleshooting Participant Guide
Dell PowerEdge Troubleshooting Participant Guide
TROUBLESHOOTING
PARTICIPANT GUIDE
PARTICIPANT GUIDE
Dell PowerEdge Troubleshooting-SSP1
Visual Indicators 10
Left and Right Control Panels 10
PSU Indicator Codes 13
Mid-Bay Hard Drive Indicators 16
System ID Button 18
System Board LED 19
System Board Jumper Settings 21
Knowledge Check - Control Panel 23
Recovery Options 25
iDRAC Default Settings 25
Lifecycle Controller - Part Replacement Configuration 26
Easy Restore 27
Export and Import Server Configuration Profile 29
LAB - Exporting a Server Configuration Profile 31
Logs 32
Lifecycle Controller Logs 32
System Event Logs (SEL) 37
POST Code, Intrusion, Last Crash Screen 37
SupportAssist Enterprise Overview 39
Secure Connect Gateway Overview 41
Gathering SupportAssist Logs 43
Memory 62
Memory Event Logging 62
GPU 65
Updating the NVIDIA Drivers 65
GPU XID Errors 66
WDDM/VDI Modes for GPU 67
Undetected GPU 69
GPU Memory Page Error 70
Appendix 73
Support Library
The screen captures show the searching the Dell support library for articles about POST
failures.
The screen capture shows the results of searching for PowerEdge R660 troubleshooting
manuals.
For example, during a server reboot the you get a message during POST:
Memory set to minimum frequency. Searching the Dell support
show the knowledge base article that applies to the error.
The screen captures show the Dell support drivers and downloads page for the
PowerEdge R660.
• System BIOS
• System Firmware
• Device Drivers
The screen captures show searching for PowerEdge R760 hardware replacement videos.
The Dell support site > videos provides a suite of "How To Replace" QRL
videos.
Also, the QR codes that are on the supported products provide access to
the commonly referenced videos, document reference materials, technical
support, and sales teams. The Dell Quick Resource Locator (QRL) is a
web page that allows users to quickly get at-the-box videos and
documentation supporting Dell products.
Visual Indicators
Tip: Dell employees can use the Blink tool to identify and
define component indicators, such as LED sequence on
system boards, PSUs, control panels, and so on.
Left and right control panel and the optional Quick Sync 2 control panel.
The Left control panel (LCP) provides system health at a glance. The
system health and system ID indicator are on the left control panel of the
system. When troubleshooting the server, the first indication of a problem
that an administrator may see is a panel indicator that is amber.
The table below provides the description and condition of each LCP
indicator.
1 Indicates that the system is powered on, is healthy, and system ID mode
is not active. Press the system health and system ID button to switch to
system ID mode.
2 Indicates that the system ID mode is active. Press the system health and
Icon/Ports Feature
Given the scenario: One of the PSUs is replaced on the R660 server. The
diagnostic LED blinks green for 5 times and then stays off. The iDRAC UI
shows that the PSU is failed. After reseating the PSU, the behavior
remains. Service individuals can use the LED behavior to isolate the
issue. In this scenario, the behavior is due to a mismatched PSU.
Although the server supports different PSUs with different power outputs,
the PSUs in the server need to match.
See the participant guide for the PSU diagnostics indicator definitions.
Off Off
Solid Normal
System ID Button
• If the system stops responding during POST, press and hold the
System ID button for more than five seconds to enter BIOS progress
mode.
• To reset the iDRAC (if not disabled in F2 iDRAC setup) press and hold
the button for more than 15 seconds.
The image details the OmniVu LED codes for Dell PowerEdge XR11 and XR12 power
sequencing.
The BIOS
password
feature is
disabled (pins
4–6). The BIOS
password is
now disabled
and users are
not allowed to
set a new
password.
The BIOS
configuration
settings are
cleared at
system boot
(pins 1–3).
Jumper settings on the PowerEdge R760 system board.
PSU.
d. Check the temperature status of the server components to identify
the source of the excessive heat.
Recovery Options
The System Setup utility has three options available to reset iDRAC to
default settings.
• In a situation where preserving the iDRAC network settings and user
accounts are needed, use the Reset iDRAC configuration to
defaults option.
For example, the service engineer replaces a faulty fPERC. The Part
Replacement Configuration feature updates the part firmware
automatically when the server boots.
Easy Restore
Given the scenario and question: The service engineer replaced the
system board on a PowerEdge server. How is the server information
retained or restored?
The Easy Restore feature automatically restores the service tag, licenses,
UEFI configuration, system configuration settings (BIOS, iDRAC, NIC) and
OEM ID (Personality Module).
Easy Restore Storage is part of the server front panel that can store up to
4 MB of data. All data is backed up in a backup flash device automatically.
If BIOS detects a new system board and the service tag in the backup
flash device, BIOS prompts the user to restore the backup information.
See the participant guide for the steps to restore the service tag using
Easy Restore.
The steps to restore the service tag using Easy Restore are:
1. Turn on the system.
2. If BIOS detects a new system board, and if the service tag is present in
the backup flash device, BIOS displays the service tag, the status of
the license, and the UEFI Diagnostics version. Do one of the following:
a. * Press Y to restore the service tag, license, and diagnostics
information.
b. Press N to go to the Lifecycle Controller based restore options.
c. Press <F10> to restore data from a previously created Hardware
Server Profile.
3. After the restore process is complete, BIOS prompts to restore the
system configuration data.
Do one of the following:
The Import operation imports the file from a network share. Import applies
the previously saved or updated configurations that are contained in the
file to a system.
Video
The How To video demonstrates Exporting the SCP. Select the video
navigation play icon to start the video. Also, in the navigation is the
ability to show the video in full screen. Closed captioning is provided in the
video navigation bar settings.
Movie:
The web version of this content contains a movie.
Many of the SCP import fields are similar to the SCP export function.
Users can select a graceful, forced, or no reboot option. Users can also
set a wait time before the server reboots after importing the SCP.
Logs
Logs are a primary tool for isolating and identifying system health,
isolating errors, and verifying changes. Typically, when addressing an
issue, the logs are viewed before actions are taken.
Log Activities
Activity Description
System Health Display all alerts that are related to hardware within
the system chassis.
1. Click Maintenance.
2. Click Lifecycle Log.
Users can filter the logs by category, severity, keyword, or date range.
1. Click the + icon for the required log entry. The Message ID details are
displayed.
2. Enter the comments for the log entry in the Comment box.
This image shows the steps to adding comments to the Lifecycle logs.
This image shows the steps to export the Lifecycle Controller logs.
This image shows the steps to export the Lifecycle Controller logs.
How To Video
The How To video demonstrates exporting the Lifecycle Logs. Select the
video navigation play icon to start the video. Also, in the navigation is
the ability to show the video in full screen. Closed captioning is provided in
the video navigation bar settings.
Movie:
The web version of this content contains a movie.
POST Code, Intrusion, and Last Crash Screen are troubleshooting tools
that the iDRAC provides. Each tool automatically provides a report when a
system event occurs. Administrators and engineers can use the
information when escalating issues to technical support.
1: The POST Code option helps view the last system POST code (in
hexadecimal) before booting the operating system of the managed
system. The POST code helps to detect pre-video errors, report fatal
errors, and analyze the system failures during BIOS POST, particularly the
No POST No Video situations. The fatal error codes are used to report all
the fatal POST errors.
3: The Last Crash Screen option provides information about the events
leading to the system crash. This information is saved in the iDRAC
memory and is remotely accessible. The Last Crash Screen feature is
available with iDRAC Express and Enterprise licenses.
The last crash screen capture is only available with the Windows
operating system, and the user must have installed Open Manage Server
Administrator. The last crash screen capture does not work with Linux or
Video
Movie:
The web version of this content contains a movie.
Server Monitoring
the issue. The issue might be that the rack PDU is faulty or simply the
power cable is disconnected.
− Server Manager
− Task Manager
− Resource Monitor
Go to: Dell Support Site and read Support for Dell EMC
OpenManage Plug-in for Nagios Core article to learn more
about OpenManage plug-in (Nagios Core).
16G uses a new naming scheme7. The naming affects the system board,
peripheral devices, and risers. Connector and cable naming is a key item
to isolating faulty cables and ports. Service engineers must be able to
identify a cable or port based on the nomenclature used in log entries.
Many errors are the result of mis-cabling.
The graphic defines the naming of Mini Cool Edge I/O Connector 5 (MCIO
Connector 5) on a system board.
Cabling example for naming convention of backplane, fPERC, and the system board.
BP-Backplane P-PCIe
CTRL-PERC X-XGMII
If a riser has fewer than four connectors, the unused numbers are
skipped. For example, riser 1 has two connectors, therefore SL19, and
SL20 are skipped.
Log Errors
Error Types
Error Message
The boot capture timestamp records the sequence end time. This occurs
when the capture reaches 2 MB in size or the server is rebooted.
iDRAC UI
Boot capture files reflecting under the Troubleshooting tab in the iDRAC.
The list displays the currently active boot capture file. While the update is
in progress, click Refresh to view the latest timestamp for the boot
capture file.
User can play the files directly from the iDRAC Enterprise or save them to
a location on your system.
To configure the boot capture video settings, select one of the following
options and click Apply.
Video
Movie:
The web version of this content contains a movie.
iDRAC Diagnostics
Diagnostics Console command reflecting under the Maintenance tab in the iDRAC.
Diagnostic commands:
• arp
• ifconfig
• netstat
• ping
• gettracelog
• ping6
Hardware Diagnostics
The Hardware Diagnostics utility can validate the memory, I/O devices,
CPU, physical disk drives, and other peripherals.
Minimum to POST
The table shows the steps to take to help resolve no power, no video, and
no POST issues.
2: Reset the power 2: Check the video 1: Check the LCD screen
supply. a. Verify that interface cabling from or LED indicators for any
the power source is the system to the error messages.
working properly by monitor. Servers have
connecting a device two VGA ports. The
that draws a similar front VGA port is on
amount of power. the right control panel
and the rear VGA port
is on the RIO board. If
the system is liquid
cooled, there is no rear
VGA port.
3: Replace the power 3: Run the LCD Built-in 2: Ensure the server is
supply. The server Self Test (BIST). turned on by verifying the
does not turn on by power supply LED light
using the front the bar.
ear node.
Memory
The 15G and 16G memory event logging uses a common set of event
messages9 to describe the recommended action instead of describing the
underlying event.
When the BIOS generates host visible correctable errors, the events are
logged in the Technical Support Report (TSR) and Serial Presence Detect
(SPD).
The system receives a summary of single bit error correcting data per
DIMM once per day.
1. If the error log values are greater than existing, the prior values are
overwritten10. It is subject to a threshold.
2. If the on-DIMM error log is below the reported threshold, it reports zero
errors.
the event of an UCE, the device where the error occurred is identified for
replacement. Mirrored memory is resilient to uncorrectable errors because
it has two copies of the data.
GPU
Solution:
1. Direct the customer to the NVIDIA driver reference portal.
2. Guide the customer to input the system details: Product Type, Product
Series, Product, Operating System, and CUDA Toolkit version.
3. Direct the user to click the Search button.
4. Request the user to download the latest version of the driver file and
install in the environment.
Solution:
1. Collect the XID number from the user. In this scenario, the XID number
is 9.
2. Open the NVIDIA GPU Management and Deployment guide.
3. Verify the cause of the issue by referring to the XID number. In this
scenario, the cause is mapped to "driver error”. Problems in the core of
the driver cause the driver errors.
4. Direct the customer to update the driver to the latest version. If the
customer has the latest version of the driver, then ask them to install
the previous version of the driver and retest. The latest OS logs enable
the customer to understand if the issue is due to a specific build of the
driver or the GPU combination.
Solution:
1. Direct the customer to verify the compatibility of the GPUs with
WDDM/VDI mode.
2. If the GPUs are compatible with Windows Display Driver Model
(WDDM) and Virtual Desktop Infrastructure (VDI) mode, ask the
customer to boot the server.
3. After booting the server, open a remote terminal session to the server.
− For the system with the latest version of the GPUs, run the
command " gpumodeswitch –gpumode graphics " to enable
WDDM/VDI modes.
− For the system with the previous version of the GPUs, run the
command " nvidia-smi -dm 1 " to enable WDDM/VDI modes.
NVIDIA A100 HGX 80/40 GB GPUs support the WDDM 2.0+ based
functionality, only if the virtual GPU (vGPU) software is installed.
Undetected GPU
Solution:
1. Direct the customer to verify all GPUs, and cables are seated correctly.
2. Ask the user to reboot the server. Ensure that the server is running
with the NVIDIA driver (or CUDA).
3. After booting the server, open a terminal session to the server and
type:
− # nvidia-smi --query-
gpu=gpu_name,serial,pic.bus_id,pcie.link.gen.curr
ent,
pcie.link.width.current,ecc.erors.corrected.aggre
gate.total,
ecc.errors.uncorrected.aggregate.total,temperatur
e.gpu.power.draw --format=csv
4. Identify the serial numbers of all GPUs from the obtained output.
− In this scenario, the user has installed four GPUs but the query has
displayed the serial number of only three GPUs.
6. The GPU serial number that is not displayed in the command output is
the non-functional GPU on the server.
Solution:
1. Direct the customer to reboot the server. Ensure that the server is
running with the NVIDIA driver (or CUDA).
2. After booting the server, open a terminal session to the server and
type: # nvidida-smi -i 0320415054563 -q -d ECC
3. If the Single Bit Error (SBE) value is greater than 1000 or if the Double
Bit Error (DBE) value is greater than 10, it indicates that the GPU is
failing. If so, request the customer to replace the GPUs to avoid further
damage.
a. In this scenario, the output shows that the SBE and DBE values are
within the limits, so GPUs have no issue.
b. The common cause of high error rates can be due to physical
damage to the GPUs or excessive heat for prolonged periods of
time.
4. Inform the user that the issue might be due to an unreliable memory
page and the page will be retired by the GPU. No need to take further
action in this scenario.
PPCM Examples
Backplane
PERC
Riser