Dell PowerEdge Troubleshooting Participant Guide

DELL POWEREDGE
TROUBLESHOOTING
PARTICIPANT GUIDE
PARTICIPANT GUIDE
Dell PowerEdge Troubleshooting-SSP1
© Copyright 2023 Dell Inc Page 2

Table of Contents
Dell PowerEdge Troubleshooting 5
Guidelines and Resources 5

Support Library 5
Server Troubleshooting Guides 6
Dell Update Package (DUP) and Firmware Updates 6
How to Replace - Videos 8
LAB - Downloading and Updating the iDRAC Firmware 9
Visual Indicators 10
Left and Right Control Panels 10
PSU Indicator Codes 13
Mid-Bay Hard Drive Indicators 16
System ID Button 18
System Board LED 19
System Board Jumper Settings 21
Knowledge Check - Control Panel 23
Recovery Options 25
iDRAC Default Settings 25
Lifecycle Controller - Part Replacement Configuration 26
Easy Restore 27
Export and Import Server Configuration Profile 29
LAB - Exporting a Server Configuration Profile 31
Logs 32
Lifecycle Controller Logs 32
System Event Logs (SEL) 37
POST Code, Intrusion, Last Crash Screen 37
SupportAssist Enterprise Overview 39
Secure Connect Gateway Overview 41
Gathering SupportAssist Logs 43

Server Monitoring 45
LAB - Performing a SupportAssist Collection 46
Fault Isolation Tools 48

Configuration Validation Overview 48
Connector and Cable Naming 49
Log Errors 53
Boot and Crash Capture 55
iDRAC Diagnostics 57
Hardware Diagnostics 58
Minimum to POST 59
No Power, No Video, No POST 59
Memory 62
Memory Event Logging 62
GPU 65
Updating the NVIDIA Drivers 65
GPU XID Errors 66
WDDM/VDI Modes for GPU 67
Undetected GPU 69
GPU Memory Page Error 70
Appendix 73

Guidelines and Resources
Dell PowerEdge Troubleshooting
Support Library
Important: Some resources on the Dell support site are

permissions based and can only be accessed with a
corporate account.
The screen captures show the searching the Dell support library for articles about POST
failures.
A key resource to consider when troubleshooting is the Dell Support

Library. Administrators and service engineers can download resources to
help isolate and resolve issues. The example shows the support library
and searching for POST failure. The library provides users with
knowledgebase articles that relate to specific issues.

Server Troubleshooting Guides
The screen capture shows the results of searching for PowerEdge R660 troubleshooting
manuals.
Many of the server support pages provide troubleshooting guides. The

example shows searching for PowerEdge R660 troubleshooting guides.
For example, during a server reboot the you get a message during POST:
Memory set to minimum frequency. Searching the Dell support
show the knowledge base article that applies to the error.
Dell Update Package (DUP) and Firmware Updates

The screen captures show the Dell support drivers and downloads page for the
PowerEdge R660.
Using DUPs, engineers can update a wide range of system components

simultaneously and apply scripts to similar sets of Dell systems to bring
the system components to the same version levels. Many issues are
resolved by upgrading to the latest firmware.
Following software components are updated using a DUP:
• System BIOS
• System Firmware
• Device Drivers
Administrators and engineers can download the DUP and firmware

updates on the drivers and download page. The most common method of
identifying your product is using the Dell Service Tag. A general search
without using filters can result in hundreds of packages in the list. Use the
filters to narrow the results. The example searches for the operating
system driver package for a Windows Server 2022 deployment.

How to Replace - Videos
Important: The QRL videos do not have scripts or closed

captions.
The screen captures show searching for PowerEdge R760 hardware replacement videos.
The Dell support site > videos provides a suite of "How To Replace" QRL
videos.
When isolating an issue such as a cabling error, the component video is a

resource to help locate the cable ports and cable routing.
Also, the QR codes that are on the supported products provide access to
the commonly referenced videos, document reference materials, technical
support, and sales teams. The Dell Quick Resource Locator (QRL) is a
web page that allows users to quickly get at-the-box videos and
documentation supporting Dell products.

LAB - Downloading and Updating the iDRAC Firmware
Lab Exercise: You are investigating an issue on a

PowerEdge R660 servers where there are failures when
trying to import an server configuration profile (SCP). You
notice that the server runs an older version of iDRAC
firmware. You consult the iDRAC9 release notes and see
that the SCP import is resolved in a later version. Upgrade
the iDRAC firmware to resolve the issue. Complete the
Downloading and Updating the iDRAC Firmware lab
activity.
Tip: If the iDRAC simulator is no longer open in a browser

tab, relaunch the simulator.

Visual Indicators
Visual Indicators
Left and Right Control Panels
Tip: Dell employees can use the Blink tool to identify and
define component indicators, such as LED sequence on
system boards, PSUs, control panels, and so on.
Left and right control panel and the optional Quick Sync 2 control panel.
The Left control panel (LCP) provides system health at a glance. The
system health and system ID indicator are on the left control panel of the
system. When troubleshooting the server, the first indication of a problem
that an administrator may see is a panel indicator that is amber.
For example, the administrator notices that the temperature indicator is

amber. The amber LED does not isolate the issue, but prompts the
administrator to inspect further by checking the logs and the thermal
outputs of the system components. The administrator may call Dell
support for assistance on actions to take.

Visual Indicators
See the participant guide for the indicator definitions.
Left Control Panel (LCP)
The table below provides the description and condition of each LCP
indicator.
Icon Description Condition
System ID • Solid blue1

• Blinking blue2
• Solid amber3
• Blinking amber4
Drive indicator The indicator turns solid amber

when there is a drive error.
1 Indicates that the system is powered on, is healthy, and system ID mode
is not active. Press the system health and system ID button to switch to
system ID mode.
2 Indicates that the system ID mode is active. Press the system health and
system ID button to switch to system health mode.

3 Indicates that the system is in fail-safe mode.
4 Indicates that the system is experiencing a fault.

Visual Indicators
Temperature The indicator turns solid amber

indicator when the system experiences a
thermal error.
Electrical indicator The indicator turns solid amber

when the system experiences an
electrical error.
Memory indicator The indicator turns solid amber

when a memory error occurs.
PCIe indicator The indicator turns solid amber

when a PCIe card experiences an
error.
Optional Quick Sync The icon indicates the panel is the

2 option Quick Sync 2 panel.
Right Control Panel (RCP)
The table below provides the feature of each RCP port.
Icon/Ports Feature
Power button with integrated power LED.

Visual Indicators
USB 2.0 port.
Micro-USB port for iDRAC Direct.
PSU Indicator Codes
Tip: The Blink tool can be used to identify and define

component indicators.

Visual Indicators
Power supply unit indicator.
PSU portfolio includes intelligent features such as dynamically optimizing

efficiency while maintaining availability and redundancy. The PSUs have
diagnostic indicators.
Given the scenario: One of the PSUs is replaced on the R660 server. The
diagnostic LED blinks green for 5 times and then stays off. The iDRAC UI
shows that the PSU is failed. After reseating the PSU, the behavior
remains. Service individuals can use the LED behavior to isolate the
issue. In this scenario, the behavior is due to a mismatched PSU.
Although the server supports different PSUs with different power outputs,
the PSUs in the server need to match.

Visual Indicators
See the participant guide for the PSU diagnostics indicator definitions.
Caution: All DC power supply unit (PSU) installations

require a qualified electrician. Do not attempt connecting to
DC power or installing grounds. All electrical wiring must
comply with applicable local or national codes and practices.
Server warranty does not include damage due to self
installation. All service must be approved by Dell. Read and
follow all safety instructions that come with the product.
Important: Due to cooling requirements, any open PSU

slots must have a blank installed.
LED Color LED behavior Function Description
Solid Green PSU functioning
Blinking Amber, 2 s ON, 1 s PSU fault

OFF
Blinking Green for 5 times PSU mismatch

and then stays OFF.
Blinking Green Firmware update in

progress
Off PSU power cable removed.

Visual Indicators
Mid-Bay Hard Drive Indicators
XD, or extra disk, servers such as the PowerEdge R760xd2 server

supports may have mid-bay drives with separate indicators. When
investigating a disk error or disk issue, slide the server from the rack to the
service position and view the mid-bay drive indicators.
Caution: Mid-Bay should not be in service position for

longer than five minutes. The Hard Disk Drive (HDD)
temperature LED will blink fast when temperature is critical,
at this point close the mid bay and allow system to reach
normal temperature.
Mid Hard Drive Indicators of R760xd2.
Mid-Bay hard drive LED Indicator Description Hard drive

Temperature
Status

Visual Indicators
Off Off
Solid Normal
Slow Blink Warning
Fast Blink Critical

Visual Indicators
System ID Button
PowerEdge R760 showing the System ID button location.
PowerEdge servers have a rear System ID button that can be used as an

alternate power button if the front power button is inoperable.
Engineers can use the System ID button for troubleshooting in the

following cases:
• If the system stops responding during POST, press and hold the
System ID button for more than five seconds to enter BIOS progress
mode.
• To reset the iDRAC (if not disabled in F2 iDRAC setup) press and hold
the button for more than 15 seconds.
To power on the system using rear System ID button:

• Remove the top cover to activate the intrusion switch.
• Press and hold the System ID button for at least 16 seconds. This will
reset iDRAC and power on the server, bypassing the front power
button.
Tip: To manually change the boot partition of the iDRAC in

case of an image corruption, hold the SystemID button for
20 seconds, release for 5 seconds, and repeat 3 times. This
sequence marks the stand-by partition as the primary and
reboot iDRAC.

Visual Indicators
System Board LED
Dell PowerEdge system board LED Indicator.
Individuals troubleshooting a Power-On Self-Test (POST) or hardware

issue can consult the system board LEDs, also called OmniVu LEDs. The
indicators provide status during the boot process. Each combination of
LEDs indicate a different server status.
PowerEdge servers may have different sequencing codes. See the

participant guide for an example of the OmniVu LED codes for the
PowerEdge XR11 and XR12 power sequencing.

Visual Indicators
The image details the OmniVu LED codes for Dell PowerEdge XR11 and XR12 power
sequencing.

Visual Indicators
System Board Jumper Settings
PowerEdge R660 showing the jumper location and default settings.
The software security features of a server include a system password and

a setup password. The password jumper enables or disables password
features and clears any passwords currently in use.
Given the scenario: An administrator cannot set a BIOS password. The

BIOS settings do not allow a password to be set. The administrator
suspects that the jumper may be set to disable the BIOS password
feature. For security, the server needs the BIOS password feature
enabled. The administrator must ensure the jumper is set across pins 2
and 4. View the participant guide for jumper setting definitions.

Visual Indicators
Important: Use caution when changing the BIOS settings of

a server. The BIOS interface is designed for advanced
users. Any changes in the setting might prevent the system
from starting correctly.
Important: For more information on how to disable a

forgotten password and assign a new system password by
moving the jumper for a physical server, see the server-
specific Installation Service Manual (ISM) available on the
Dell Support Library. User need to have a corporate account
to access the FSMs.
Jumper Setting Description
PWRD_EN The BIOS

password
feature is
enabled (pins
2–4).
The BIOS
password
feature is
disabled (pins
4–6). The BIOS
password is
now disabled
and users are
not allowed to
set a new
password.

Visual Indicators
NVRAM_CLR The BIOS

configuration
settings are
retained at
system boot
(pins 3–5).
The BIOS
configuration
settings are
cleared at
system boot
(pins 1–3).
Jumper settings on the PowerEdge R760 system board.
Knowledge Check - Control Panel
PowerEdge R660 right control panel.
1. Refer to the graphic. After powering on the PowerEdge R660 server,

you notice that one of the indicator on the right control panel shows
solid amber. What is the next course of action you should take?
a. Check the iDRAC Lifecycle and System Event logs for memory
errors.
b. Check the network connections on the PCIe card.
c. Check the indicator LEDs on the PSUs to determine the faulty

Visual Indicators
PSU.
d. Check the temperature status of the server components to identify
the source of the excessive heat.

Recovery Options
Recovery Options
iDRAC Default Settings
iDRAC default settings options.
The iDRAC is responsible for system profile settings and out-of-band

management. At times, there may be system conditions that can cause
the iDRAC to become unresponsive. When this occurs, resetting the
iDRAC back to factory defaults may help to resolve the issue.
The System Setup utility has three options available to reset iDRAC to
default settings.
• In a situation where preserving the iDRAC network settings and user
accounts are needed, use the Reset iDRAC configuration to
defaults option.

Recovery Options
• A situation when resetting the server to factory settings and and

returning the default username and password to the shipping value on
the Service Tag, use the Reset iDRAC configuration to default all
option.
• Using the Reset iDRAC configuration to default factory settings
option resets the server to the factory settings and resets the default
username and password to the shipping value of root/calvin.
Lifecycle Controller - Part Replacement Configuration
Part Replacement Configuration option in the Lifecycle Controller.
The Part Replacement feature in the Lifecycle Controller can automatically

update the firmware version or configuration of a new or replaced part.
For example, the service engineer replaces a faulty fPERC. The Part
Replacement Configuration feature updates the part firmware
automatically when the server boots.

Recovery Options
Important: If Collect System Inventory On Restart is

disabled, the cache of system inventory information may
become stale if new components are added without
manually entering Lifecycle Controller after turning the
system on. In manual mode, press After the Part
Replacement during a system reset.
It is important to ensure that the following prerequisites are met before

configuring replaced parts.
• Click the Collect System Inventory On Restart option, so that
Lifecycle Controller automatically invokes Part Firmware Update and
Part Configuration Update when the system is started.
• Ensure that the Disabled option under Part Firmware Update and Part
Configuration Update is cleared.
• The previous component and the new device must be identified as the
same part.
• If the current adapter on the system is NPAR enabled and is replaced
with a new adapter, after the host server is turned on, press <F2> and
select System Setup > Device Settings and ensure that the NPAR is
enabled. NPAR must be enabled on the new adapter before using the
Part Replacement feature.
Easy Restore

Recovery Options
The graphic shows the information that Easy Restore generates.
Given the scenario and question: The service engineer replaced the
system board on a PowerEdge server. How is the server information
retained or restored?
The Easy Restore feature automatically restores the service tag, licenses,
UEFI configuration, system configuration settings (BIOS, iDRAC, NIC) and
OEM ID (Personality Module).
Easy Restore Storage is part of the server front panel that can store up to
4 MB of data. All data is backed up in a backup flash device automatically.
If BIOS detects a new system board and the service tag in the backup
flash device, BIOS prompts the user to restore the backup information.
After the restore process completes, the system reboots.
See the participant guide for the steps to restore the service tag using
Easy Restore.

Recovery Options
The steps to restore the service tag using Easy Restore are:
1. Turn on the system.
2. If BIOS detects a new system board, and if the service tag is present in
the backup flash device, BIOS displays the service tag, the status of
the license, and the UEFI Diagnostics version. Do one of the following:
a. * Press Y to restore the service tag, license, and diagnostics
information.
b. Press N to go to the Lifecycle Controller based restore options.
c. Press <F10> to restore data from a previously created Hardware
Server Profile.
3. After the restore process is complete, BIOS prompts to restore the
system configuration data.
Do one of the following:
a. * Press Y to restore the system configuration data.

b. Press N to use the default configuration settings.
Export and Import Server Configuration Profile
The Server Configuration Profile (SCP) enables administrators or service

engineers to import and export a server configuration. SCP files are
typically used as a gold configuration server, but can also be used to
recover a server configuration. For example, if the network configuration of
a server is unintentionally deleted, the administrator can import an SCP to
restore the information.
Select each tab to learn more.

Recovery Options
Export and Import
The graphic shows the SCP export page following an export.
Administrators can deploy an SCP to multiple servers, greatly reducing the

time to bring servers online.
The export operation collects the configuration information for BIOS,

iDRAC, RAID, NIC, FC-HBA, System, and Lifecycle Controller. The export
stores the information in a single file that is copied to a network share.
The Import operation imports the file from a network share. Import applies
the previously saved or updated configurations that are contained in the
file to a system.
Video
The How To video demonstrates Exporting the SCP. Select the video
navigation play icon to start the video. Also, in the navigation is the
ability to show the video in full screen. Closed captioning is provided in the
video navigation bar settings.
Movie:
The web version of this content contains a movie.

Recovery Options
Tip: Users can manage the Server Configuration Profile

feature using the iDRAC UI, RACADM, and Redfish.
The SCP requires administrative privileges to perform an export and

import.
The types of exports are:
• A basic export uses a snapshot of the SCP.

• A replacement export restores to a known baseline.
• A clone export imports the SCP to another server with identical
hardware.
Many of the SCP import fields are similar to the SCP export function.
Users can select a graceful, forced, or no reboot option. Users can also
set a wait time before the server reboots after importing the SCP.
LAB - Exporting a Server Configuration Profile
Lab Exercise: You are installing four PowerEdge R660

servers. The customer wants the configuration of all four
servers to match the configuration of another R660. You
need to export the server configuration profile that will be
used to set the baseline configuration of the four new
servers. Complete the Exporting a Server Configuration
Profile lab activity.


Logs
Logs
Lifecycle Controller Logs
Logs are a primary tool for isolating and identifying system health,
isolating errors, and verifying changes. Typically, when addressing an
issue, the logs are viewed before actions are taken.
Lifecycle Controller logs provide the history of changes that relate to

components installed on a managed system. The log is delivered as part
of the iDRAC and embedded Unified Extensible Firmware Interface (UEFI)
applications.
Select each tab to learn more about Lifecycle Controller logs.
Log Activities
The following events and activities are logged:
Activity Description
System Health Display all alerts that are related to hardware within
the system chassis.
Storage Display alerts related to the storage subsystem.
Updates Display alerts generated due to firmware and driver

upgrades and downgrades.
Audit Display audit logs.
Configuration Display alerts that relate to hardware, firmware, and

software configuration changes.

Logs
Viewing Lifecycle log using web interface
To view the Lifecycle Logs:
1. Click Maintenance.
2. Click Lifecycle Log.
This image shows the steps to viewing the Lifecycle Logs.
Filtering Lifecycle logs
Users can filter the logs by category, severity, keyword, or date range.
1. On the Lifecycle Log page, click Filter.

2. Select the filtering criteria drop-down by Severity, Log Type, Date
Range, and Keyword Search.

Logs
This image shows the steps to filtering the Lifecycle Logs.
Adding comments to Lifecycle logs
To add comments to the Lifecycle logs:
1. Click the + icon for the required log entry. The Message ID details are
displayed.
2. Enter the comments for the log entry in the Comment box.

Logs
This image shows the steps to adding comments to the Lifecycle logs.
Exporting Lifecycle Controller logs using web interface
To export the Lifecycle Controller logs for troubleshooting and log

retention purposes:
1. On the Lifecycle Log page, click Export.

2. Select any of the following options:
a. Network Share: Export the Lifecycle Controller logs to a shared

location on the network.
b. Local: Export the Lifecycle Controller logs to a location on the local
system.

Logs
This image shows the steps to export the Lifecycle Controller logs.
This image shows the steps to export the Lifecycle Controller logs.
How To Video
The How To video demonstrates exporting the Lifecycle Logs. Select the
video navigation play icon to start the video. Also, in the navigation is
the ability to show the video in full screen. Closed captioning is provided in
the video navigation bar settings.
Movie:

Logs
System Event Logs (SEL)
When a system event occurs, it is recorded in the SEL. Technical support

may ask service engineers or administrators to download the SEL. Much
like the Lifecycle logs, the SEL is one the the first places to check for and
verify issues.
The SEL page displays a system health indicator, a timestamp, and a

description for each event logged.
To view the SEL in the iDRAC Web interface:

1. Go to Maintenance.
2. On the Maintenance page, select System Event Log.
This image shows the steps to view the SEL.
POST Code, Intrusion, Last Crash Screen
POST Code, Intrusion, and Last Crash Screen are troubleshooting tools
that the iDRAC provides. Each tool automatically provides a report when a
system event occurs. Administrators and engineers can use the
information when escalating issues to technical support.

Logs
Users can access the tools by going to iDRAC Dashboard ->

Maintenance -> Troubleshooting.
1: The POST Code option helps view the last system POST code (in
hexadecimal) before booting the operating system of the managed
system. The POST code helps to detect pre-video errors, report fatal
errors, and analyze the system failures during BIOS POST, particularly the
No POST No Video situations. The fatal error codes are used to report all
the fatal POST errors.
2: The Intrusion option is related to the chassis intrusion switch. It

provides information about whether the server cover is removed or not
seated correctly. This issue can lead to the system overheating and
potential shutdown issues.
3: The Last Crash Screen option provides information about the events
leading to the system crash. This information is saved in the iDRAC
memory and is remotely accessible. The Last Crash Screen feature is
available with iDRAC Express and Enterprise licenses.
The last crash screen capture is only available with the Windows
operating system, and the user must have installed Open Manage Server
Administrator. The last crash screen capture does not work with Linux or

Logs
ESXi operating system. The purpose of this feature is to display a blue

screen if the Windows operating system should fail.
SupportAssist Enterprise Overview
SupportAssist Enterprise can be used as a stand-alone application or with

OpenManage Essentials (OpenManage Enterprise) or Microsoft System
Center Operations Manager (SCOM). SupportAssist Enterprise can be
downloaded on either a Windows or Linux management server.
Important: After April 2022, SupportAssist Enterprise 2.0.70

capabilities such as device management, case creation, and
alert monitoring will not be available. To continue to manage
and monitor devices, users must upgrade to secure connect
gateway. Click here to learn more about secure connect
gateway.

Logs
Dell SupportAssist Enterprise at work monitoring and reacting to a PowerEdge MX7000

Modular System hardware issue.
SupportAssist Enterprise (SAE)5 monitors hardware issues, including the

predictive failure on drives that may occur on devices that are being
managed using Microsoft System Center Operations Manager (SCOM) or
OpenManage Enterprise.
• When a hardware issue is detected, SupportAssist Enterprise

automatically opens a support case with Technical Support and sends
an email notification to the user.
5 SAE is an application that can be installed on a Windows server, as a

virtual appliance or plug in for OpenManage Enterprise.

Logs
• SupportAssist Enterprise automatically collects the system state

information that is required for troubleshooting the issue and sends it
securely to Dell Technologies.
• The collected system information helps Technical Support to provide
an enhanced, personalized, and efficient support experience.
• SupportAssist Enterprise capability also includes a proactive response
from Technical Support to resolve the issue.
Tip: SupportAssist Enterprise monitors up to 15,000 server,

storage, and networking devices.
Secure Connect Gateway Overview
Secure connect gateway monitors devices and proactively detects

hardware issues. The administrator may receive a notification when a
case is generated.
When a hardware issue is detected, the gateway automatically collects the

system state information that is required for troubleshooting the issue.
Secure connect gateway auto-dispatches parts and service engineers to
the site for certain Dell devices and components.
Tip: An adapter can be deployed to monitor devices already

being managed by OpenManage Enterprise.
Deep Dive: Visit the Secure Connect Gateway page to

learn more.

Logs
Secure connect gateway 5.x is offered as an application that can be

installed on Windows, Linux, or as a virtual appliance that is deployed onto
either a VMware ESXi or Microsoft Hyper-V virtual infrastructure.
Secure connect gateway discovers Dell devices6 to provide alert

monitoring, log gathering, and case generation.
Secure connect gateway architecture reacting to a PowerEdge MX7000 modular system

hardware issue.
6Supported products include server, storage, chassis, networking, data

protection devices, virtual machines, and converged or hyper converged
appliances.

Logs
Gathering SupportAssist Logs
SupportAssist continually monitors the configuration data and usage

information of managed hardware and software. Data that is collected by
SupportAssist includes:
• System Information to include hardware, software, sensor, and
Lifecycle Controller data.
• Storage Logs to capture hard drive inventory, events, and
configuration options related to storage.
• Operating system and Application Data to include OS-related
information. However, operating system data can only be collected
when the iDRAC Service Module (iSM) is installed and running. Install
the iSM using downloads.dell.com.
• Debug Logs to include iDRAC debugging related information.
• Telemetry Reports to include telemetry logs that consist of detailed
parametric data about sensors, thermals, logfiles and more.
The iDRAC provides a SupportAssist utility for gathering server

information that enables support services to resolve platform and system
problems. SupportAssist helps monitor the system and data center. As an
example, technical support may ask the administrator or engineer to
generate a SupportAssist package to further analyze data about sensors.
Administrators and service engineers can export a SupportAssist

collection to a location on the host (local) or to a shared network location
such as FTP, HTTP, or file share.

Logs
Steps to generate SupportAssist logs
Gathering SupportAssist Logs.
1. To generate the SupportAssist logs, go to Maintenance.

2. On the Maintenance page, select SupportAssist.
3. Click Start a Collection to generate the SupportAssist log.
Video
The How To video demonstrates run a SupportAssist collection. Select

the video navigation play icon to start the video. Also, in the navigation
is the ability to show the video in full screen. Closed captioning is provided
in the video navigation bar settings.
Movie:

Logs
Tip: SupportAssist Collection takes more than 10 minutes to

complete when performed from OS/iDRAC while OMSA
10.1.0.0 is running with it.
To generate the operating system and application logs,
install the iDRAC Service Module and run on the Host
operating system. See the participant guide for information
about the data collected.
Server Monitoring
Technical Support Report (TSR) log generated by SupportAssist on a PowerEdge

R760xa.
Server monitoring reviews and analyzes operation-related processes such

as performance, security, and identifying issues.
As an example, the administrator has email notification configured to send

alerts with issue occur. The administrator receives and email notification of
a PSU issue and uses SupportAssist to gather information. SupportAssist
allows engineers and technical support to view the server without the need
to login. In this example, the tool identifies the PSU, but not the cause of

Logs
the issue. The issue might be that the rack PDU is faulty or simply the
power cable is disconnected.
Server monitoring tools commonly used:

• Integrated Dell Remote Access Controller (iDRAC)
• OpenManage Enterprise (OME)
• Dell Open Server Manager (OSM)
• SupportAssist
• Dell OpenManage Server Administrator (OMSA)
• Windows-Integrated Monitoring tools:
− Server Manager
− Task Manager
− Resource Monitor
Go to: Dell Support Site and read Support for Dell EMC
OpenManage Plug-in for Nagios Core article to learn more
about OpenManage plug-in (Nagios Core).
LAB - Performing a SupportAssist Collection
Lab Exercise: You have installed a PowerEdge R660

server. You notice that there are health and status errors,
but cannot immediately determine the reason for the errors.
Technical support asks that you gather server information
using SupportAssist. Complete the Performing a
SupportAssist Collection lab activity.

Logs


Fault Isolation Tools
Configuration Validation Overview
Configuration validation is a vital tool for troubleshooting issues related to

cabling. For example, the service engineer replaces a faulty backplane.
When the server is powered on, Configuration Validation runs and
discovers that a slimline cable is improperly connected to a backplane
cable port. The tool achieves this by comparing the current configuration
with the expected configuration.
Configuration validation compares backplane memory maps against a list

of pre-qualified configurations each time the host powers on.
Pre-qualified validation elements:

• Pre-qualified configurations are defined by the Portfolio Platform
Configuration Matrix.
• Each platform device is stored as configuration element that includes
inform information such as riser number, backplane feature, and cable
connector.
• The iDRAC maintains a table of valid configurations.
Go to: PowerEdge R660 Portfolio Platform Configuration

Matrix example.

Instructor Note: Mention what config validation is used for

in the real-world. Manufacturing teams would use it to
check for defects when the systems are assembled in the
factory. It is also used when the systems are shipped to
customers to check that the components or cables have not
come loose in transit. Tech support will use it after a field
service event, especially after system board replacement.
Configuration Validation is introduced in the PowerEdge 15G servers.
Connector and Cable Naming
16G uses a new naming scheme7. The naming affects the system board,
peripheral devices, and risers. Connector and cable naming is a key item
to isolating faulty cables and ports. Service engineers must be able to
identify a cable or port based on the nomenclature used in log entries.
Many errors are the result of mis-cabling.
Select each tab for more information:
Planar Naming Rules
System Board High-Speed I/O (HSIO) connectors connect to devices and

backplanes from the source PCIe, SATA, XGMII, or other HSIO fabrics.
The system board naming scheme includes a basic connector name,
connector number, and source device and the port fabric type.
7The new scheme accommodates the increase of new supported

components.

The graphic defines the naming of Mini Cool Edge I/O Connector 5 (MCIO
Connector 5) on a system board.
PowerEdge C6620 SIL.
Peripheral Device Naming Rule
Peripheral device HSIO connectors follow a similar format that of a system

board naming rule. For example, defining the nomenclature in the error:
HWC8010 The System Configuration Check operation

resulted in the following issue: Config Error:
Backplane Cable CTRL_SRC_SA1 and BP_DST_SA1

Cabling example for naming convention of backplane, fPERC, and the system board.
Where BP_DST_SA1 is:

• BP - Device type
• DST - Direction
• SA1 - Fabric type
Device Types Fabric Types
BP-Backplane P-PCIe
CTRL-BOSS card S-SAS or SATA
CTRL-PERC X-XGMII
CTRL-Bridge card U-UPI
CTRL-Cabled Riser Z-Gen-Z

Riser Naming Rules
PowerEdge R660 example of cabled riser connections.
Riser naming HSIO connectors follow a similar format that of a system

board naming rule.
Each cabled riser connectors has four designated connector numbers.

The riser slot number determines the numbering. For example, riser 1
uses connectors 17 through 20 and, riser 2 uses connector 21 through 24.
The number starts with the lowest connector number.
If a riser has fewer than four connectors, the unused numbers are
skipped. For example, riser 1 has two connectors, therefore SL19, and
SL20 are skipped.

Caution: The connector name font size on the printed circuit

board silkscreen8 is small, and may be unclear. Accidentally
swapping connections such as SL7 for SL8 can occur.
Troubleshooting includes verifying that the cables are plugged in
properly.
Tip: See the Connector Naming job aid downloadable from
the course resources to learn more.
Log Errors
Configuration validation errors are based on either missing or incorrect

configuration elements. A Config Error, Config Missing, and Comm Error
are all part of the HWC8010 error message and will show on the POST
text during boot and also in the Lifecycle log, the System Event Log. The
error message can direct service engineers to the suspect cable and
connector.
8The silkscreen is the layer of ink on a printed circuit board component

used for identification.

Select each tab for more information:
Error Types
The table shows examples of error types and its description.
Error Type Description
Config Error (Configuration A configuration error may be associated

Error) with a mis-configuration or a wrong
configuration. Engineers can use the error
output to locate and check the cable and
component.
Config Missing Engineers may see the configuration

missing error when the cable is not
connected or damaged. Cables can come
loose during shipping and can be
overlooked when replacing a faulty
component. The output of the error
message can help isolate the cable.
Comm Error Component cables have sideband

(Communication Error) communication. Sideband cables are
common in GPU implementations. The
communications error can come from
components that exist, but do not
communicate. Typically, reseating the
component and cable can resolve the
issue.
Error Message
The table shows examples of error codes and error recommended

responses.

Error Example Log Message (LC, SEL, Initial Action

Code POST)
HWC8010 The system configuration check Check for proper

operation resulted in the following cable connection and
issue: Config Error: Backplane Cable component
CTRL_SRC_SA1 and BP_DST_SA1. placement. Reseat
cables.
HWC8011 The system configuration check
operation resulted in multiple
backplane cable issues.
HWC8012 Multiple configuration-related issues No action is required.

on the device <arg> are resolved.
HWC8013 A configuration-related issue on the No action is required.

device <arg> is resolved.
Important: Minimum to POST configurations may generate

many errors. This is because Minimum to POST
configuration is just a troubleshooting step.
Boot and Crash Capture
The boot capture enables administrators and service engineers to view

the video recording of the last three boot cycles. Technical support can
analyze the boot capture to help troubleshoot issues.
The boot capture timestamp records the sequence end time. This occurs
when the capture reaches 2 MB in size or the server is rebooted.

iDRAC UI
Boot capture files reflecting under the Troubleshooting tab in the iDRAC.
The list displays the currently active boot capture file. While the update is
in progress, click Refresh to view the latest timestamp for the boot
capture file.
User can play the files directly from the iDRAC Enterprise or save them to
a location on your system.
To configure the boot capture video settings, select one of the following
options and click Apply.
• Disable - Boot capture is disabled.

• Capture until buffer full - Boot sequence is captured until the buffer
size has reached.
• Capture until end of POST - Boot sequence is captured until end of
POST.
Video
The How To video demonstrates running a boot capture. Select the

video navigation play icon to start the video. Also, in the navigation is
the ability to show the video in full screen. Closed captioning is provided in
the video navigation bar settings.

Movie:
iDRAC Diagnostics
Diagnostics Console command reflecting under the Maintenance tab in the iDRAC.
The Diagnostics Console command page helps identify issues related to

the iDRAC hardware.
Given an example where the administrator can no longer manage the

system through the dedicated iDRAC port. Using iDRAC Direct, the
engineer can use the diagnostic console to run commands and inspect the
port settings.
The iDRAC provides a list of available troubleshooting commands user

can enter into the diagnostic console. These commands provide the user
with data related to troubleshooting.
Diagnostic commands:
• arp
• ifconfig
• netstat

• ping
• gettracelog
• ping6
Hardware Diagnostics
The Hardware Diagnostic utility is part of the Lifecycle Controller.

Diagnostics utility have a physical (as opposed to logical) view of the
attached hardware, enabling it to identify hardware problems that the
operating system and other online tools cannot identify.
To help identify hardware issues, deployment and service engineers can

run the Hardware Diagnostics utility to validate that the attached hardware
is functioning properly.
The Hardware Diagnostics utility can validate the memory, I/O devices,
CPU, physical disk drives, and other peripherals.

Tip: The ePSA (Pre-boot System Assessment) procedure

depends on the server generation.
Minimum to POST
Troubleshooting a difficult problem may require removing components to

isolate an issue. The server must have a minimum configuration to
achieve Power On Self Test (POST). Service engineers may be asked to
configure the server in a Minimum to POST hardware configuration. Once
verifying the system can achieve POST, the engineer can add the
removed components one at a time to identify the faulty component.
The required components vary based on the server model:

• The typical minimum to POST configuration for rack servers is PSU1,
CPU1, memory module in A1 slot, RIO, LOM, and the default riser
without expansion cards.
• For tower servers, the typical minimum to POST configuration is PSU1,
CPU1, and memory module in A1 the slot.
• For modular servers, the minimum to POST configuration is CPU1,
Mezz A and memory module in the A1 slot.
No Power, No Video, No POST
The table shows the steps to take to help resolve no power, no video, and
no POST issues.
No Power No Video No POST

1: Swap the AC 1: Check the cable The Power On Self Test

power cable with a connections (power (POST) is a series of
known-good power and display) to the diagnostic tests that run
cable. If the system monitor. automatically when the
works with a known- system is tuned on.
good AC power POST tests memory,
cable, replace the keyboard and the disk
power cable. drivers. If the test is
(optional). successful, the computer
boots itself, else the
system displays an LED
error or an error
message on the LCD
panel.
2: Reset the power 2: Check the video 1: Check the LCD screen
supply. a. Verify that interface cabling from or LED indicators for any
the power source is the system to the error messages.
working properly by monitor. Servers have
connecting a device two VGA ports. The
that draws a similar front VGA port is on
amount of power. the right control panel
and the rear VGA port
is on the RIO board. If
the system is liquid
cooled, there is no rear
VGA port.
3: Replace the power 3: Run the LCD Built-in 2: Ensure the server is
supply. The server Self Test (BIST). turned on by verifying the
does not turn on by power supply LED light
using the front the bar.
ear node.
4: Ensure the proper 3: Before handling server

power is going to the components or cables,
chassis. take all the precautions
to avoid ESD damage.

5: Ensure all power 4: Disconnect all the

supplies are firmly cables from the server
seated, power cables including the power
are connected, and cable.
both power supplies
are operating.
6: Turn on the server 5: Reconnect the power

by using the power and video cable only.
button.
6: Attempt to POST the

server.
7: Disconnect the hard

drives, optical drives,
and tape drives from the
server and attempt to
POST the server.
8: Reseat the control

panel connector.
9: Ensure the processors

and heat sinks are
seated correctly.
10: If the server does not

complete the POST,
clear the NVRAM using
the jumper.

Memory
Memory
Memory Event Logging
When analyzing memory errors, an uncorrectable error generates a

message to replace the DIMM. Correctable errors are typically resolved
when the server reboots. Service engineers use the information in the log
entry to identify and replace the DIMM.
The 15G and 16G memory event logging uses a common set of event
messages9 to describe the recommended action instead of describing the
underlying event.
PowerEdge servers include a memory error logging feature that provides

error tracing for suspected failing DIMMs.
• Confirmation of an individual memory DIMM with one or more errors.
• Documentation on the error locations (failing DRAM device and cell)
and type of errors.
• Enablement of population-wide statistical data providing individual
DIMM part numbers, and BIOS revisions.
Select each tab to learn about types of error logging changes:
9 BIOS chooses the severity of the message by selecting the appropriate

IPMI sensor that maps to the severity. Dedicated sensors have identical
event data parameters for each level of severity. Details of what generated
the event (for example, a PPR self-healing failed) are encoded as a
‘debug code’ in the event message body.

Memory
Correctable Error (CE) Logging Changes
Correctable Error logging is disabled by default. However, based on

individual customer needs it can be enabled in the BIOS.
When the BIOS generates host visible correctable errors, the events are
logged in the Technical Support Report (TSR) and Serial Presence Detect
(SPD).
The system receives a summary of single bit error correcting data per
DIMM once per day.
Two different error log values:
1. If the error log values are greater than existing, the prior values are
overwritten10. It is subject to a threshold.
2. If the on-DIMM error log is below the reported threshold, it reports zero
errors.
Uncorrectable Error (UCE) Logging Changes
Two error message codes for uncorrectable error are:

1. MEM7114 message: The error indicator recommends replacing the
DIMM with a MEM711411 message whenever an uncorrectable error is
10 It is overwritten with the highest historical value that is registered to

diagnose the error. Because of the space constraint in the SPD rather
than TSR, it is overwritten.
11 This error message code indicates a critical severity and recommends
the action of contacting support and requesting to replace the parts

(DIMM).

Memory
detected12 or consumed13. Recommendation is made regardless of

PPR error correction.
2. MEM5100 message: A MEM510014 message is generated when error
events occur in a mirrored memory15 region. In order to trigger a Mirror
Failover event, one DRAM must trigger an Uncorrectable Error. The
failed device needs replacement as part of the mirror remediation. Due
to the mirroring remediation, when running in Fault resilient mode there
should be no data loss.
Tip: Only host (CPU) ECC correctable errors are included in

the TSR memory log. On-die ECC single bit errors are not.
12 Memory patrol scrub uses the CPU memory controller to periodically

scan DRAM and correct any single-bit errors that it encounters. Demand
Scrub occurs when the memory controller encounters a correctable error
during a regular run-time read transaction and writes back corrected data.
13 Consumed by reading or writing to the impacted area.
14 This error message code indicates an informational message and
appears when an uncorrectable error occurs in mirrored memory. It does

not indicate any action, as it is just an informational message.
15 A mirrored region where data integrity is maintained by a mirror copy. In
the event of an UCE, the device where the error occurred is identified for
replacement. Mirrored memory is resilient to uncorrectable errors because
it has two copies of the data.

GPU
GPU
Updating the NVIDIA Drivers
A common resolution to resolving issues is upgrading the component to

the latest firmware revision. When troubleshooting GPU related issue,
always verify the server is running the latest firmware version.
Scenario: An administrator contacts a Technical Support Engineer stating

that the graphics processing units (GPUs) are not compatible in their latest
environment due to unknown errors. The administrator wants to know the
information about the latest version of the GPU driver that should be
installed in their environment.
Solution:
1. Direct the customer to the NVIDIA driver reference portal.
2. Guide the customer to input the system details: Product Type, Product
Series, Product, Operating System, and CUDA Toolkit version.
3. Direct the user to click the Search button.
4. Request the user to download the latest version of the driver file and
install in the environment.

GPU
Tip: For more information on toolkit and driver installation,

download the NVIDIA GPU CUDA Toolkit & Driver
Installation Procedure from the references
Important: The GPU drivers available on the Dell support

site are used to validate the GPUs in the chassis. It is
possible that a later version of the driver might be available
in the NVIDIA portal.
GPU XID Errors
Scenario: A customer contacts a Technical Support Engineer with a failed

GPU based on the System Event Logs (SEL). The customer indicates that
XID fault #9 is produced in the SEL. The customer wants to identify the
cause of the issue. This scenario shows how using the resources easily
identifies the failure, before acting on the error.

GPU
Solution:
1. Collect the XID number from the user. In this scenario, the XID number
is 9.
2. Open the NVIDIA GPU Management and Deployment guide.
3. Verify the cause of the issue by referring to the XID number. In this
scenario, the cause is mapped to "driver error”. Problems in the core of
the driver cause the driver errors.
4. Direct the customer to update the driver to the latest version. If the
customer has the latest version of the driver, then ask them to install
the previous version of the driver and retest. The latest OS logs enable
the customer to understand if the issue is due to a specific build of the
driver or the GPU combination.
WDDM/VDI Modes for GPU
Scenario: A customer contacts technical support. The customer is unable

to use the GPU in WDDM/VDI mode. When identifying configuration
settings, the technical support may need to talk the customer through the
process of verifying the proper settings.

GPU
Solution:
1. Direct the customer to verify the compatibility of the GPUs with
WDDM/VDI mode.
2. If the GPUs are compatible with Windows Display Driver Model
(WDDM) and Virtual Desktop Infrastructure (VDI) mode, ask the
customer to boot the server.
3. After booting the server, open a remote terminal session to the server.
− For the system with the latest version of the GPUs, run the
command " gpumodeswitch –gpumode graphics " to enable
WDDM/VDI modes.
− For the system with the previous version of the GPUs, run the
command " nvidia-smi -dm 1 " to enable WDDM/VDI modes.
NVIDIA A100 HGX 80/40 GB GPUs support the WDDM 2.0+ based
functionality, only if the virtual GPU (vGPU) software is installed.

GPU
Important: Some GPUs may not have the ability to function

in the WDDM/VDI mode. The ideal way to resolve the issue
is to verify what modes are supported by a specific GPU.
Details about a GPU are available on the NVIDIA technical
support site.
Undetected GPU
Scenario: A customer contacts technical support and states that a GPU

core (A100) is not functional on their Linux server. The customer also
states that four physical GPUs are installed but only three are detected.
The customer wants to identify which GPU is causing the issue. When
analyzing issues such as unrecognized component or missing component,
the most common cause is component or cabling connections. This is
especially true if the technical support engineer sees that the server had
recent repairs.
Solution:
1. Direct the customer to verify all GPUs, and cables are seated correctly.
2. Ask the user to reboot the server. Ensure that the server is running
with the NVIDIA driver (or CUDA).
3. After booting the server, open a terminal session to the server and
type:
− # nvidia-smi --query-
gpu=gpu_name,serial,pic.bus_id,pcie.link.gen.curr
ent,
pcie.link.width.current,ecc.erors.corrected.aggre
gate.total,
ecc.errors.uncorrected.aggregate.total,temperatur
e.gpu.power.draw --format=csv
4. Identify the serial numbers of all GPUs from the obtained output.
− In this scenario, the user has installed four GPUs but the query has
displayed the serial number of only three GPUs.

GPU
5. Login to the iDRAC to view the NVIDIA A100 GPU information.

− In the iDRAC, identify the missing GPU slot number to the
corresponding GPU slot number that is printed on the metal jig
inside the server.
6. The GPU serial number that is not displayed in the command output is
the non-functional GPU on the server.
− In this scenario GPU 0320415057649 in slot 24 is not displayed in

the command output.
GPU Memory Page Error
Scenario: A customer contacts technical support and states that the

server throws a bad memory page error on the GPU with serial number
"0320415054563". The customer wants to understand the reason for the
memory error.

GPU
Solution:
1. Direct the customer to reboot the server. Ensure that the server is
running with the NVIDIA driver (or CUDA).
2. After booting the server, open a terminal session to the server and
type: # nvidida-smi -i 0320415054563 -q -d ECC
3. If the Single Bit Error (SBE) value is greater than 1000 or if the Double
Bit Error (DBE) value is greater than 10, it indicates that the GPU is
failing. If so, request the customer to replace the GPUs to avoid further
damage.

GPU
a. In this scenario, the output shows that the SBE and DBE values are
within the limits, so GPUs have no issue.
b. The common cause of high error rates can be due to physical
damage to the GPUs or excessive heat for prolonged periods of
time.
4. Inform the user that the issue might be due to an unreliable memory
page and the page will be retired by the GPU. No need to take further
action in this scenario.
Important: The GPU memory pages become unreliable

when multiple Single Bit ECC Errors or Double Bit ECC
Errors occur on the same memory page. The GPUs are
internally designed to retire GPU memory pages when they
become unreliable.
A DBE causes the GPU to halt operations. The system
displays a fatal error in logs and provides the details of the
slot or the bus.

Appendix
PPCM Examples
Backplane
The Portfolio Platform Configuration Matrix show the support and

configuration options.
PERC
Riser

Appendix

10 Gigabit Media Independent Interface (XGMII)
XGMII connects full duplex 10 GbE ports to each other and to other
devices on a printed circuit board. XGMII is typically used for on-chip
connections.

Dell PowerEdge Troubleshooting Participant Guide

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dell PowerEdge Troubleshooting Participant Guide

Uploaded by

Copyright:

Available Formats

DELL POWEREDGE

© Copyright 2023 Dell Inc Page 2

Dell PowerEdge Troubleshooting 5

Guidelines and Resources 5

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 3

Fault Isolation Tools 48

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 4