Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

DELL POWEREDGE

TROUBLESHOOTING

PARTICIPANT GUIDE

PARTICIPANT GUIDE
Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 2


Table of Contents

Dell PowerEdge Troubleshooting 5

Guidelines and Resources 5


Support Library 5
Server Troubleshooting Guides 6
Dell Update Package (DUP) and Firmware Updates 6
How to Replace - Videos 8
LAB - Downloading and Updating the iDRAC Firmware 9

Visual Indicators 10
Left and Right Control Panels 10
PSU Indicator Codes 13
Mid-Bay Hard Drive Indicators 16
System ID Button 18
System Board LED 19
System Board Jumper Settings 21
Knowledge Check - Control Panel 23

Recovery Options 25
iDRAC Default Settings 25
Lifecycle Controller - Part Replacement Configuration 26
Easy Restore 27
Export and Import Server Configuration Profile 29
LAB - Exporting a Server Configuration Profile 31

Logs 32
Lifecycle Controller Logs 32
System Event Logs (SEL) 37
POST Code, Intrusion, Last Crash Screen 37
SupportAssist Enterprise Overview 39
Secure Connect Gateway Overview 41
Gathering SupportAssist Logs 43

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 3


Server Monitoring 45
LAB - Performing a SupportAssist Collection 46

Fault Isolation Tools 48


Configuration Validation Overview 48
Connector and Cable Naming 49
Log Errors 53
Boot and Crash Capture 55
iDRAC Diagnostics 57
Hardware Diagnostics 58
Minimum to POST 59
No Power, No Video, No POST 59

Memory 62
Memory Event Logging 62

GPU 65
Updating the NVIDIA Drivers 65
GPU XID Errors 66
WDDM/VDI Modes for GPU 67
Undetected GPU 69
GPU Memory Page Error 70

Appendix 73

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 4


Guidelines and Resources

Dell PowerEdge Troubleshooting

Guidelines and Resources

Support Library

Important: Some resources on the Dell support site are


permissions based and can only be accessed with a
corporate account.

The screen captures show the searching the Dell support library for articles about POST
failures.

A key resource to consider when troubleshooting is the Dell Support


Library. Administrators and service engineers can download resources to
help isolate and resolve issues. The example shows the support library
and searching for POST failure. The library provides users with
knowledgebase articles that relate to specific issues.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 5


Guidelines and Resources

Server Troubleshooting Guides

The screen capture shows the results of searching for PowerEdge R660 troubleshooting
manuals.

Many of the server support pages provide troubleshooting guides. The


example shows searching for PowerEdge R660 troubleshooting guides.

For example, during a server reboot the you get a message during POST:
Memory set to minimum frequency. Searching the Dell support
show the knowledge base article that applies to the error.

Dell Update Package (DUP) and Firmware Updates

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 6


Guidelines and Resources

The screen captures show the Dell support drivers and downloads page for the
PowerEdge R660.

Using DUPs, engineers can update a wide range of system components


simultaneously and apply scripts to similar sets of Dell systems to bring
the system components to the same version levels. Many issues are
resolved by upgrading to the latest firmware.

Following software components are updated using a DUP:

• System BIOS
• System Firmware
• Device Drivers

Administrators and engineers can download the DUP and firmware


updates on the drivers and download page. The most common method of
identifying your product is using the Dell Service Tag. A general search
without using filters can result in hundreds of packages in the list. Use the
filters to narrow the results. The example searches for the operating
system driver package for a Windows Server 2022 deployment.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 7


Guidelines and Resources

How to Replace - Videos

Important: The QRL videos do not have scripts or closed


captions.

The screen captures show searching for PowerEdge R760 hardware replacement videos.

The Dell support site > videos provides a suite of "How To Replace" QRL
videos.

When isolating an issue such as a cabling error, the component video is a


resource to help locate the cable ports and cable routing.

Also, the QR codes that are on the supported products provide access to
the commonly referenced videos, document reference materials, technical
support, and sales teams. The Dell Quick Resource Locator (QRL) is a
web page that allows users to quickly get at-the-box videos and
documentation supporting Dell products.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 8


Guidelines and Resources

LAB - Downloading and Updating the iDRAC Firmware

Lab Exercise: You are investigating an issue on a


PowerEdge R660 servers where there are failures when
trying to import an server configuration profile (SCP). You
notice that the server runs an older version of iDRAC
firmware. You consult the iDRAC9 release notes and see
that the SCP import is resolved in a later version. Upgrade
the iDRAC firmware to resolve the issue. Complete the
Downloading and Updating the iDRAC Firmware lab
activity.

Tip: If the iDRAC simulator is no longer open in a browser


tab, relaunch the simulator.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 9


Visual Indicators

Visual Indicators

Left and Right Control Panels

Tip: Dell employees can use the Blink tool to identify and
define component indicators, such as LED sequence on
system boards, PSUs, control panels, and so on.

Left and right control panel and the optional Quick Sync 2 control panel.

The Left control panel (LCP) provides system health at a glance. The
system health and system ID indicator are on the left control panel of the
system. When troubleshooting the server, the first indication of a problem
that an administrator may see is a panel indicator that is amber.

For example, the administrator notices that the temperature indicator is


amber. The amber LED does not isolate the issue, but prompts the
administrator to inspect further by checking the logs and the thermal
outputs of the system components. The administrator may call Dell
support for assistance on actions to take.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 10


Visual Indicators

See the participant guide for the indicator definitions.

Left Control Panel (LCP)

The table below provides the description and condition of each LCP
indicator.

Icon Description Condition

System ID • Solid blue1


• Blinking blue2
• Solid amber3
• Blinking amber4

Drive indicator The indicator turns solid amber


when there is a drive error.

1 Indicates that the system is powered on, is healthy, and system ID mode
is not active. Press the system health and system ID button to switch to
system ID mode.
2 Indicates that the system ID mode is active. Press the system health and

system ID button to switch to system health mode.


3 Indicates that the system is in fail-safe mode.

4 Indicates that the system is experiencing a fault.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 11


Visual Indicators

Temperature The indicator turns solid amber


indicator when the system experiences a
thermal error.

Electrical indicator The indicator turns solid amber


when the system experiences an
electrical error.

Memory indicator The indicator turns solid amber


when a memory error occurs.

PCIe indicator The indicator turns solid amber


when a PCIe card experiences an
error.

Optional Quick Sync The icon indicates the panel is the


2 option Quick Sync 2 panel.

Right Control Panel (RCP)

The table below provides the feature of each RCP port.

Icon/Ports Feature

Power button with integrated power LED.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 12


Visual Indicators

USB 2.0 port.

Micro-USB port for iDRAC Direct.

PSU Indicator Codes

Tip: The Blink tool can be used to identify and define


component indicators.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 13


Visual Indicators

Power supply unit indicator.

PSU portfolio includes intelligent features such as dynamically optimizing


efficiency while maintaining availability and redundancy. The PSUs have
diagnostic indicators.

Given the scenario: One of the PSUs is replaced on the R660 server. The
diagnostic LED blinks green for 5 times and then stays off. The iDRAC UI
shows that the PSU is failed. After reseating the PSU, the behavior
remains. Service individuals can use the LED behavior to isolate the
issue. In this scenario, the behavior is due to a mismatched PSU.
Although the server supports different PSUs with different power outputs,
the PSUs in the server need to match.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 14


Visual Indicators

See the participant guide for the PSU diagnostics indicator definitions.

Caution: All DC power supply unit (PSU) installations


require a qualified electrician. Do not attempt connecting to
DC power or installing grounds. All electrical wiring must
comply with applicable local or national codes and practices.
Server warranty does not include damage due to self
installation. All service must be approved by Dell. Read and
follow all safety instructions that come with the product.

Important: Due to cooling requirements, any open PSU


slots must have a blank installed.

LED Color LED behavior Function Description

Solid Green PSU functioning

Blinking Amber, 2 s ON, 1 s PSU fault


OFF

Blinking Green for 5 times PSU mismatch


and then stays OFF.

Blinking Green Firmware update in


progress

Off PSU power cable removed.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 15


Visual Indicators

Mid-Bay Hard Drive Indicators

XD, or extra disk, servers such as the PowerEdge R760xd2 server


supports may have mid-bay drives with separate indicators. When
investigating a disk error or disk issue, slide the server from the rack to the
service position and view the mid-bay drive indicators.

Caution: Mid-Bay should not be in service position for


longer than five minutes. The Hard Disk Drive (HDD)
temperature LED will blink fast when temperature is critical,
at this point close the mid bay and allow system to reach
normal temperature.

Mid Hard Drive Indicators of R760xd2.

Mid-Bay hard drive LED Indicator Description Hard drive


Temperature
Status

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 16


Visual Indicators

Off Off

Solid Normal

Slow Blink Warning

Fast Blink Critical

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 17


Visual Indicators

System ID Button

PowerEdge R760 showing the System ID button location.

PowerEdge servers have a rear System ID button that can be used as an


alternate power button if the front power button is inoperable.

Engineers can use the System ID button for troubleshooting in the


following cases:

• If the system stops responding during POST, press and hold the
System ID button for more than five seconds to enter BIOS progress
mode.
• To reset the iDRAC (if not disabled in F2 iDRAC setup) press and hold
the button for more than 15 seconds.

To power on the system using rear System ID button:


• Remove the top cover to activate the intrusion switch.
• Press and hold the System ID button for at least 16 seconds. This will
reset iDRAC and power on the server, bypassing the front power
button.

Tip: To manually change the boot partition of the iDRAC in


case of an image corruption, hold the SystemID button for
20 seconds, release for 5 seconds, and repeat 3 times. This
sequence marks the stand-by partition as the primary and
reboot iDRAC.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 18


Visual Indicators

System Board LED

Dell PowerEdge system board LED Indicator.

Individuals troubleshooting a Power-On Self-Test (POST) or hardware


issue can consult the system board LEDs, also called OmniVu LEDs. The
indicators provide status during the boot process. Each combination of
LEDs indicate a different server status.

PowerEdge servers may have different sequencing codes. See the


participant guide for an example of the OmniVu LED codes for the
PowerEdge XR11 and XR12 power sequencing.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 19


Visual Indicators

The image details the OmniVu LED codes for Dell PowerEdge XR11 and XR12 power
sequencing.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 20


Visual Indicators

System Board Jumper Settings

PowerEdge R660 showing the jumper location and default settings.

The software security features of a server include a system password and


a setup password. The password jumper enables or disables password
features and clears any passwords currently in use.

Given the scenario: An administrator cannot set a BIOS password. The


BIOS settings do not allow a password to be set. The administrator
suspects that the jumper may be set to disable the BIOS password
feature. For security, the server needs the BIOS password feature
enabled. The administrator must ensure the jumper is set across pins 2
and 4. View the participant guide for jumper setting definitions.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 21


Visual Indicators

Important: Use caution when changing the BIOS settings of


a server. The BIOS interface is designed for advanced
users. Any changes in the setting might prevent the system
from starting correctly.

Important: For more information on how to disable a


forgotten password and assign a new system password by
moving the jumper for a physical server, see the server-
specific Installation Service Manual (ISM) available on the
Dell Support Library. User need to have a corporate account
to access the FSMs.

Jumper Setting Description

PWRD_EN The BIOS


password
feature is
enabled (pins
2–4).

The BIOS
password
feature is
disabled (pins
4–6). The BIOS
password is
now disabled
and users are
not allowed to
set a new
password.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 22


Visual Indicators

NVRAM_CLR The BIOS


configuration
settings are
retained at
system boot
(pins 3–5).

The BIOS
configuration
settings are
cleared at
system boot
(pins 1–3).
Jumper settings on the PowerEdge R760 system board.

Knowledge Check - Control Panel

PowerEdge R660 right control panel.

1. Refer to the graphic. After powering on the PowerEdge R660 server,


you notice that one of the indicator on the right control panel shows
solid amber. What is the next course of action you should take?
a. Check the iDRAC Lifecycle and System Event logs for memory
errors.
b. Check the network connections on the PCIe card.
c. Check the indicator LEDs on the PSUs to determine the faulty

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 23


Visual Indicators

PSU.
d. Check the temperature status of the server components to identify
the source of the excessive heat.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 24


Recovery Options

Recovery Options

iDRAC Default Settings

iDRAC default settings options.

The iDRAC is responsible for system profile settings and out-of-band


management. At times, there may be system conditions that can cause
the iDRAC to become unresponsive. When this occurs, resetting the
iDRAC back to factory defaults may help to resolve the issue.

The System Setup utility has three options available to reset iDRAC to
default settings.
• In a situation where preserving the iDRAC network settings and user
accounts are needed, use the Reset iDRAC configuration to
defaults option.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 25


Recovery Options

• A situation when resetting the server to factory settings and and


returning the default username and password to the shipping value on
the Service Tag, use the Reset iDRAC configuration to default all
option.
• Using the Reset iDRAC configuration to default factory settings
option resets the server to the factory settings and resets the default
username and password to the shipping value of root/calvin.

Lifecycle Controller - Part Replacement Configuration

Part Replacement Configuration option in the Lifecycle Controller.

The Part Replacement feature in the Lifecycle Controller can automatically


update the firmware version or configuration of a new or replaced part.

For example, the service engineer replaces a faulty fPERC. The Part
Replacement Configuration feature updates the part firmware
automatically when the server boots.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 26


Recovery Options

Important: If Collect System Inventory On Restart is


disabled, the cache of system inventory information may
become stale if new components are added without
manually entering Lifecycle Controller after turning the
system on. In manual mode, press After the Part
Replacement during a system reset.

It is important to ensure that the following prerequisites are met before


configuring replaced parts.
• Click the Collect System Inventory On Restart option, so that
Lifecycle Controller automatically invokes Part Firmware Update and
Part Configuration Update when the system is started.
• Ensure that the Disabled option under Part Firmware Update and Part
Configuration Update is cleared.
• The previous component and the new device must be identified as the
same part.
• If the current adapter on the system is NPAR enabled and is replaced
with a new adapter, after the host server is turned on, press <F2> and
select System Setup > Device Settings and ensure that the NPAR is
enabled. NPAR must be enabled on the new adapter before using the
Part Replacement feature.

Easy Restore

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 27


Recovery Options

The graphic shows the information that Easy Restore generates.

Given the scenario and question: The service engineer replaced the
system board on a PowerEdge server. How is the server information
retained or restored?

The Easy Restore feature automatically restores the service tag, licenses,
UEFI configuration, system configuration settings (BIOS, iDRAC, NIC) and
OEM ID (Personality Module).

Easy Restore Storage is part of the server front panel that can store up to
4 MB of data. All data is backed up in a backup flash device automatically.
If BIOS detects a new system board and the service tag in the backup
flash device, BIOS prompts the user to restore the backup information.

After the restore process completes, the system reboots.

See the participant guide for the steps to restore the service tag using
Easy Restore.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 28


Recovery Options

The steps to restore the service tag using Easy Restore are:
1. Turn on the system.
2. If BIOS detects a new system board, and if the service tag is present in
the backup flash device, BIOS displays the service tag, the status of
the license, and the UEFI Diagnostics version. Do one of the following:
a. * Press Y to restore the service tag, license, and diagnostics
information.
b. Press N to go to the Lifecycle Controller based restore options.
c. Press <F10> to restore data from a previously created Hardware
Server Profile.
3. After the restore process is complete, BIOS prompts to restore the
system configuration data.
Do one of the following:

a. * Press Y to restore the system configuration data.


b. Press N to use the default configuration settings.

Export and Import Server Configuration Profile

The Server Configuration Profile (SCP) enables administrators or service


engineers to import and export a server configuration. SCP files are
typically used as a gold configuration server, but can also be used to
recover a server configuration. For example, if the network configuration of
a server is unintentionally deleted, the administrator can import an SCP to
restore the information.

Select each tab to learn more.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 29


Recovery Options

Export and Import

The graphic shows the SCP export page following an export.

Administrators can deploy an SCP to multiple servers, greatly reducing the


time to bring servers online.

The export operation collects the configuration information for BIOS,


iDRAC, RAID, NIC, FC-HBA, System, and Lifecycle Controller. The export
stores the information in a single file that is copied to a network share.

The Import operation imports the file from a network share. Import applies
the previously saved or updated configurations that are contained in the
file to a system.

Video

The How To video demonstrates Exporting the SCP. Select the video
navigation play icon to start the video. Also, in the navigation is the
ability to show the video in full screen. Closed captioning is provided in the
video navigation bar settings.

Movie:
The web version of this content contains a movie.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 30


Recovery Options

Tip: Users can manage the Server Configuration Profile


feature using the iDRAC UI, RACADM, and Redfish.

The SCP requires administrative privileges to perform an export and


import.

The types of exports are:

• A basic export uses a snapshot of the SCP.


• A replacement export restores to a known baseline.
• A clone export imports the SCP to another server with identical
hardware.

Many of the SCP import fields are similar to the SCP export function.
Users can select a graceful, forced, or no reboot option. Users can also
set a wait time before the server reboots after importing the SCP.

LAB - Exporting a Server Configuration Profile

Lab Exercise: You are installing four PowerEdge R660


servers. The customer wants the configuration of all four
servers to match the configuration of another R660. You
need to export the server configuration profile that will be
used to set the baseline configuration of the four new
servers. Complete the Exporting a Server Configuration
Profile lab activity.

Tip: If the iDRAC simulator is no longer open in a browser


tab, relaunch the simulator.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 31


Logs

Logs

Lifecycle Controller Logs

Logs are a primary tool for isolating and identifying system health,
isolating errors, and verifying changes. Typically, when addressing an
issue, the logs are viewed before actions are taken.

Lifecycle Controller logs provide the history of changes that relate to


components installed on a managed system. The log is delivered as part
of the iDRAC and embedded Unified Extensible Firmware Interface (UEFI)
applications.

Select each tab to learn more about Lifecycle Controller logs.

Log Activities

The following events and activities are logged:

Activity Description

System Health Display all alerts that are related to hardware within
the system chassis.

Storage Display alerts related to the storage subsystem.

Updates Display alerts generated due to firmware and driver


upgrades and downgrades.

Audit Display audit logs.

Configuration Display alerts that relate to hardware, firmware, and


software configuration changes.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 32


Logs

Viewing Lifecycle log using web interface

To view the Lifecycle Logs:

1. Click Maintenance.
2. Click Lifecycle Log.

This image shows the steps to viewing the Lifecycle Logs.

Filtering Lifecycle logs

Users can filter the logs by category, severity, keyword, or date range.

1. On the Lifecycle Log page, click Filter.


2. Select the filtering criteria drop-down by Severity, Log Type, Date
Range, and Keyword Search.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 33


Logs

This image shows the steps to filtering the Lifecycle Logs.

Adding comments to Lifecycle logs

To add comments to the Lifecycle logs:

1. Click the + icon for the required log entry. The Message ID details are
displayed.
2. Enter the comments for the log entry in the Comment box.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 34


Logs

This image shows the steps to adding comments to the Lifecycle logs.

Exporting Lifecycle Controller logs using web interface

To export the Lifecycle Controller logs for troubleshooting and log


retention purposes:

1. On the Lifecycle Log page, click Export.


2. Select any of the following options:

a. Network Share: Export the Lifecycle Controller logs to a shared


location on the network.
b. Local: Export the Lifecycle Controller logs to a location on the local
system.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 35


Logs

This image shows the steps to export the Lifecycle Controller logs.

This image shows the steps to export the Lifecycle Controller logs.

How To Video

The How To video demonstrates exporting the Lifecycle Logs. Select the
video navigation play icon to start the video. Also, in the navigation is
the ability to show the video in full screen. Closed captioning is provided in
the video navigation bar settings.

Movie:
The web version of this content contains a movie.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 36


Logs

System Event Logs (SEL)

When a system event occurs, it is recorded in the SEL. Technical support


may ask service engineers or administrators to download the SEL. Much
like the Lifecycle logs, the SEL is one the the first places to check for and
verify issues.

The SEL page displays a system health indicator, a timestamp, and a


description for each event logged.

To view the SEL in the iDRAC Web interface:


1. Go to Maintenance.
2. On the Maintenance page, select System Event Log.

This image shows the steps to view the SEL.

POST Code, Intrusion, Last Crash Screen

POST Code, Intrusion, and Last Crash Screen are troubleshooting tools
that the iDRAC provides. Each tool automatically provides a report when a
system event occurs. Administrators and engineers can use the
information when escalating issues to technical support.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 37


Logs

Users can access the tools by going to iDRAC Dashboard ->


Maintenance -> Troubleshooting.

1: The POST Code option helps view the last system POST code (in
hexadecimal) before booting the operating system of the managed
system. The POST code helps to detect pre-video errors, report fatal
errors, and analyze the system failures during BIOS POST, particularly the
No POST No Video situations. The fatal error codes are used to report all
the fatal POST errors.

2: The Intrusion option is related to the chassis intrusion switch. It


provides information about whether the server cover is removed or not
seated correctly. This issue can lead to the system overheating and
potential shutdown issues.

3: The Last Crash Screen option provides information about the events
leading to the system crash. This information is saved in the iDRAC
memory and is remotely accessible. The Last Crash Screen feature is
available with iDRAC Express and Enterprise licenses.

The last crash screen capture is only available with the Windows
operating system, and the user must have installed Open Manage Server
Administrator. The last crash screen capture does not work with Linux or

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 38


Logs

ESXi operating system. The purpose of this feature is to display a blue


screen if the Windows operating system should fail.

SupportAssist Enterprise Overview

SupportAssist Enterprise can be used as a stand-alone application or with


OpenManage Essentials (OpenManage Enterprise) or Microsoft System
Center Operations Manager (SCOM). SupportAssist Enterprise can be
downloaded on either a Windows or Linux management server.

Important: After April 2022, SupportAssist Enterprise 2.0.70


capabilities such as device management, case creation, and
alert monitoring will not be available. To continue to manage
and monitor devices, users must upgrade to secure connect
gateway. Click here to learn more about secure connect
gateway.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 39


Logs

Dell SupportAssist Enterprise at work monitoring and reacting to a PowerEdge MX7000


Modular System hardware issue.

SupportAssist Enterprise (SAE)5 monitors hardware issues, including the


predictive failure on drives that may occur on devices that are being
managed using Microsoft System Center Operations Manager (SCOM) or
OpenManage Enterprise.

• When a hardware issue is detected, SupportAssist Enterprise


automatically opens a support case with Technical Support and sends
an email notification to the user.

5 SAE is an application that can be installed on a Windows server, as a


virtual appliance or plug in for OpenManage Enterprise.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 40


Logs

• SupportAssist Enterprise automatically collects the system state


information that is required for troubleshooting the issue and sends it
securely to Dell Technologies.
• The collected system information helps Technical Support to provide
an enhanced, personalized, and efficient support experience.
• SupportAssist Enterprise capability also includes a proactive response
from Technical Support to resolve the issue.

Tip: SupportAssist Enterprise monitors up to 15,000 server,


storage, and networking devices.

Secure Connect Gateway Overview

Secure connect gateway monitors devices and proactively detects


hardware issues. The administrator may receive a notification when a
case is generated.

When a hardware issue is detected, the gateway automatically collects the


system state information that is required for troubleshooting the issue.
Secure connect gateway auto-dispatches parts and service engineers to
the site for certain Dell devices and components.

Tip: An adapter can be deployed to monitor devices already


being managed by OpenManage Enterprise.

Deep Dive: Visit the Secure Connect Gateway page to


learn more.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 41


Logs

Secure connect gateway 5.x is offered as an application that can be


installed on Windows, Linux, or as a virtual appliance that is deployed onto
either a VMware ESXi or Microsoft Hyper-V virtual infrastructure.

Secure connect gateway discovers Dell devices6 to provide alert


monitoring, log gathering, and case generation.

Secure connect gateway architecture reacting to a PowerEdge MX7000 modular system


hardware issue.

6Supported products include server, storage, chassis, networking, data


protection devices, virtual machines, and converged or hyper converged
appliances.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 42


Logs

Gathering SupportAssist Logs

SupportAssist continually monitors the configuration data and usage


information of managed hardware and software. Data that is collected by
SupportAssist includes:
• System Information to include hardware, software, sensor, and
Lifecycle Controller data.
• Storage Logs to capture hard drive inventory, events, and
configuration options related to storage.
• Operating system and Application Data to include OS-related
information. However, operating system data can only be collected
when the iDRAC Service Module (iSM) is installed and running. Install
the iSM using downloads.dell.com.
• Debug Logs to include iDRAC debugging related information.
• Telemetry Reports to include telemetry logs that consist of detailed
parametric data about sensors, thermals, logfiles and more.

Select each tab to learn more.

The iDRAC provides a SupportAssist utility for gathering server


information that enables support services to resolve platform and system
problems. SupportAssist helps monitor the system and data center. As an
example, technical support may ask the administrator or engineer to
generate a SupportAssist package to further analyze data about sensors.

Administrators and service engineers can export a SupportAssist


collection to a location on the host (local) or to a shared network location
such as FTP, HTTP, or file share.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 43


Logs

Steps to generate SupportAssist logs

Gathering SupportAssist Logs.

1. To generate the SupportAssist logs, go to Maintenance.


2. On the Maintenance page, select SupportAssist.
3. Click Start a Collection to generate the SupportAssist log.

Video

The How To video demonstrates run a SupportAssist collection. Select


the video navigation play icon to start the video. Also, in the navigation
is the ability to show the video in full screen. Closed captioning is provided
in the video navigation bar settings.

Movie:
The web version of this content contains a movie.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 44


Logs

Tip: SupportAssist Collection takes more than 10 minutes to


complete when performed from OS/iDRAC while OMSA
10.1.0.0 is running with it.
To generate the operating system and application logs,
install the iDRAC Service Module and run on the Host
operating system. See the participant guide for information
about the data collected.

Server Monitoring

Technical Support Report (TSR) log generated by SupportAssist on a PowerEdge


R760xa.

Server monitoring reviews and analyzes operation-related processes such


as performance, security, and identifying issues.

As an example, the administrator has email notification configured to send


alerts with issue occur. The administrator receives and email notification of
a PSU issue and uses SupportAssist to gather information. SupportAssist
allows engineers and technical support to view the server without the need
to login. In this example, the tool identifies the PSU, but not the cause of

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 45


Logs

the issue. The issue might be that the rack PDU is faulty or simply the
power cable is disconnected.

Server monitoring tools commonly used:


• Integrated Dell Remote Access Controller (iDRAC)
• OpenManage Enterprise (OME)
• Dell Open Server Manager (OSM)
• SupportAssist
• Dell OpenManage Server Administrator (OMSA)
• Windows-Integrated Monitoring tools:

− Server Manager
− Task Manager
− Resource Monitor

Go to: Dell Support Site and read Support for Dell EMC
OpenManage Plug-in for Nagios Core article to learn more
about OpenManage plug-in (Nagios Core).

LAB - Performing a SupportAssist Collection

Lab Exercise: You have installed a PowerEdge R660


server. You notice that there are health and status errors,
but cannot immediately determine the reason for the errors.
Technical support asks that you gather server information
using SupportAssist. Complete the Performing a
SupportAssist Collection lab activity.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 46


Logs

Tip: If the iDRAC simulator is no longer open in a browser


tab, relaunch the simulator.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 47


Fault Isolation Tools

Fault Isolation Tools

Configuration Validation Overview

Configuration validation is a vital tool for troubleshooting issues related to


cabling. For example, the service engineer replaces a faulty backplane.
When the server is powered on, Configuration Validation runs and
discovers that a slimline cable is improperly connected to a backplane
cable port. The tool achieves this by comparing the current configuration
with the expected configuration.

Configuration validation compares backplane memory maps against a list


of pre-qualified configurations each time the host powers on.

Pre-qualified validation elements:


• Pre-qualified configurations are defined by the Portfolio Platform
Configuration Matrix.
• Each platform device is stored as configuration element that includes
inform information such as riser number, backplane feature, and cable
connector.
• The iDRAC maintains a table of valid configurations.

Go to: PowerEdge R660 Portfolio Platform Configuration


Matrix example.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 48


Fault Isolation Tools

Instructor Note: Mention what config validation is used for


in the real-world. Manufacturing teams would use it to
check for defects when the systems are assembled in the
factory. It is also used when the systems are shipped to
customers to check that the components or cables have not
come loose in transit. Tech support will use it after a field
service event, especially after system board replacement.

Configuration Validation is introduced in the PowerEdge 15G servers.

Connector and Cable Naming

16G uses a new naming scheme7. The naming affects the system board,
peripheral devices, and risers. Connector and cable naming is a key item
to isolating faulty cables and ports. Service engineers must be able to
identify a cable or port based on the nomenclature used in log entries.
Many errors are the result of mis-cabling.

Select each tab for more information:

Planar Naming Rules

System Board High-Speed I/O (HSIO) connectors connect to devices and


backplanes from the source PCIe, SATA, XGMII, or other HSIO fabrics.
The system board naming scheme includes a basic connector name,
connector number, and source device and the port fabric type.

7The new scheme accommodates the increase of new supported


components.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 49


Fault Isolation Tools

The graphic defines the naming of Mini Cool Edge I/O Connector 5 (MCIO
Connector 5) on a system board.

PowerEdge C6620 SIL.

Peripheral Device Naming Rule

Peripheral device HSIO connectors follow a similar format that of a system


board naming rule. For example, defining the nomenclature in the error:

HWC8010 The System Configuration Check operation


resulted in the following issue: Config Error:
Backplane Cable CTRL_SRC_SA1 and BP_DST_SA1

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 50


Fault Isolation Tools

Cabling example for naming convention of backplane, fPERC, and the system board.

Where BP_DST_SA1 is:


• BP - Device type
• DST - Direction
• SA1 - Fabric type

Device Types Fabric Types

BP-Backplane P-PCIe

CTRL-BOSS card S-SAS or SATA

CTRL-PERC X-XGMII

CTRL-Bridge card U-UPI

CTRL-Cabled Riser Z-Gen-Z

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 51


Fault Isolation Tools

Riser Naming Rules

PowerEdge R660 example of cabled riser connections.

Riser naming HSIO connectors follow a similar format that of a system


board naming rule.

Each cabled riser connectors has four designated connector numbers.


The riser slot number determines the numbering. For example, riser 1
uses connectors 17 through 20 and, riser 2 uses connector 21 through 24.
The number starts with the lowest connector number.

If a riser has fewer than four connectors, the unused numbers are
skipped. For example, riser 1 has two connectors, therefore SL19, and
SL20 are skipped.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 52


Fault Isolation Tools

Caution: The connector name font size on the printed circuit


board silkscreen8 is small, and may be unclear. Accidentally
swapping connections such as SL7 for SL8 can occur.
Troubleshooting includes verifying that the cables are plugged in
properly.

Tip: See the Connector Naming job aid downloadable from

the course resources to learn more.

Log Errors

Configuration validation errors are based on either missing or incorrect


configuration elements. A Config Error, Config Missing, and Comm Error
are all part of the HWC8010 error message and will show on the POST
text during boot and also in the Lifecycle log, the System Event Log. The
error message can direct service engineers to the suspect cable and
connector.

8The silkscreen is the layer of ink on a printed circuit board component


used for identification.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 53


Fault Isolation Tools

Select each tab for more information:

Error Types

The table shows examples of error types and its description.

Error Type Description

Config Error (Configuration A configuration error may be associated


Error) with a mis-configuration or a wrong
configuration. Engineers can use the error
output to locate and check the cable and
component.

Config Missing Engineers may see the configuration


missing error when the cable is not
connected or damaged. Cables can come
loose during shipping and can be
overlooked when replacing a faulty
component. The output of the error
message can help isolate the cable.

Comm Error Component cables have sideband


(Communication Error) communication. Sideband cables are
common in GPU implementations. The
communications error can come from
components that exist, but do not
communicate. Typically, reseating the
component and cable can resolve the
issue.

Error Message

The table shows examples of error codes and error recommended


responses.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 54


Fault Isolation Tools

Error Example Log Message (LC, SEL, Initial Action


Code POST)

HWC8010 The system configuration check Check for proper


operation resulted in the following cable connection and
issue: Config Error: Backplane Cable component
CTRL_SRC_SA1 and BP_DST_SA1. placement. Reseat
cables.
HWC8011 The system configuration check
operation resulted in multiple
backplane cable issues.

HWC8012 Multiple configuration-related issues No action is required.


on the device <arg> are resolved.

HWC8013 A configuration-related issue on the No action is required.


device <arg> is resolved.

Important: Minimum to POST configurations may generate


many errors. This is because Minimum to POST
configuration is just a troubleshooting step.

Boot and Crash Capture

The boot capture enables administrators and service engineers to view


the video recording of the last three boot cycles. Technical support can
analyze the boot capture to help troubleshoot issues.

The boot capture timestamp records the sequence end time. This occurs
when the capture reaches 2 MB in size or the server is rebooted.

Select each tab to learn more.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 55


Fault Isolation Tools

iDRAC UI

Boot capture files reflecting under the Troubleshooting tab in the iDRAC.

The list displays the currently active boot capture file. While the update is
in progress, click Refresh to view the latest timestamp for the boot
capture file.

User can play the files directly from the iDRAC Enterprise or save them to
a location on your system.

To configure the boot capture video settings, select one of the following
options and click Apply.

• Disable - Boot capture is disabled.


• Capture until buffer full - Boot sequence is captured until the buffer
size has reached.
• Capture until end of POST - Boot sequence is captured until end of
POST.

Video

The How To video demonstrates running a boot capture. Select the


video navigation play icon to start the video. Also, in the navigation is
the ability to show the video in full screen. Closed captioning is provided in
the video navigation bar settings.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 56


Fault Isolation Tools

Movie:
The web version of this content contains a movie.

iDRAC Diagnostics

Diagnostics Console command reflecting under the Maintenance tab in the iDRAC.

The Diagnostics Console command page helps identify issues related to


the iDRAC hardware.

Given an example where the administrator can no longer manage the


system through the dedicated iDRAC port. Using iDRAC Direct, the
engineer can use the diagnostic console to run commands and inspect the
port settings.

The iDRAC provides a list of available troubleshooting commands user


can enter into the diagnostic console. These commands provide the user
with data related to troubleshooting.

Diagnostic commands:
• arp
• ifconfig
• netstat

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 57


Fault Isolation Tools

• ping
• gettracelog
• ping6

Hardware Diagnostics

The Hardware Diagnostic utility is part of the Lifecycle Controller.


Diagnostics utility have a physical (as opposed to logical) view of the
attached hardware, enabling it to identify hardware problems that the
operating system and other online tools cannot identify.

To help identify hardware issues, deployment and service engineers can


run the Hardware Diagnostics utility to validate that the attached hardware
is functioning properly.

The Hardware Diagnostics utility can validate the memory, I/O devices,
CPU, physical disk drives, and other peripherals.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 58


Fault Isolation Tools

Tip: The ePSA (Pre-boot System Assessment) procedure


depends on the server generation.

Minimum to POST

Troubleshooting a difficult problem may require removing components to


isolate an issue. The server must have a minimum configuration to
achieve Power On Self Test (POST). Service engineers may be asked to
configure the server in a Minimum to POST hardware configuration. Once
verifying the system can achieve POST, the engineer can add the
removed components one at a time to identify the faulty component.

The required components vary based on the server model:


• The typical minimum to POST configuration for rack servers is PSU1,
CPU1, memory module in A1 slot, RIO, LOM, and the default riser
without expansion cards.
• For tower servers, the typical minimum to POST configuration is PSU1,
CPU1, and memory module in A1 the slot.
• For modular servers, the minimum to POST configuration is CPU1,
Mezz A and memory module in the A1 slot.

No Power, No Video, No POST

The table shows the steps to take to help resolve no power, no video, and
no POST issues.

No Power No Video No POST

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 59


Fault Isolation Tools

1: Swap the AC 1: Check the cable The Power On Self Test


power cable with a connections (power (POST) is a series of
known-good power and display) to the diagnostic tests that run
cable. If the system monitor. automatically when the
works with a known- system is tuned on.
good AC power POST tests memory,
cable, replace the keyboard and the disk
power cable. drivers. If the test is
(optional). successful, the computer
boots itself, else the
system displays an LED
error or an error
message on the LCD
panel.

2: Reset the power 2: Check the video 1: Check the LCD screen
supply. a. Verify that interface cabling from or LED indicators for any
the power source is the system to the error messages.
working properly by monitor. Servers have
connecting a device two VGA ports. The
that draws a similar front VGA port is on
amount of power. the right control panel
and the rear VGA port
is on the RIO board. If
the system is liquid
cooled, there is no rear
VGA port.

3: Replace the power 3: Run the LCD Built-in 2: Ensure the server is
supply. The server Self Test (BIST). turned on by verifying the
does not turn on by power supply LED light
using the front the bar.
ear node.

4: Ensure the proper 3: Before handling server


power is going to the components or cables,
chassis. take all the precautions
to avoid ESD damage.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 60


Fault Isolation Tools

5: Ensure all power 4: Disconnect all the


supplies are firmly cables from the server
seated, power cables including the power
are connected, and cable.
both power supplies
are operating.

6: Turn on the server 5: Reconnect the power


by using the power and video cable only.
button.

6: Attempt to POST the


server.

7: Disconnect the hard


drives, optical drives,
and tape drives from the
server and attempt to
POST the server.

8: Reseat the control


panel connector.

9: Ensure the processors


and heat sinks are
seated correctly.

10: If the server does not


complete the POST,
clear the NVRAM using
the jumper.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 61


Memory

Memory

Memory Event Logging

When analyzing memory errors, an uncorrectable error generates a


message to replace the DIMM. Correctable errors are typically resolved
when the server reboots. Service engineers use the information in the log
entry to identify and replace the DIMM.

The 15G and 16G memory event logging uses a common set of event
messages9 to describe the recommended action instead of describing the
underlying event.

PowerEdge servers include a memory error logging feature that provides


error tracing for suspected failing DIMMs.
• Confirmation of an individual memory DIMM with one or more errors.
• Documentation on the error locations (failing DRAM device and cell)
and type of errors.
• Enablement of population-wide statistical data providing individual
DIMM part numbers, and BIOS revisions.

Select each tab to learn about types of error logging changes:

9 BIOS chooses the severity of the message by selecting the appropriate


IPMI sensor that maps to the severity. Dedicated sensors have identical
event data parameters for each level of severity. Details of what generated
the event (for example, a PPR self-healing failed) are encoded as a
‘debug code’ in the event message body.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 62


Memory

Correctable Error (CE) Logging Changes

Correctable Error logging is disabled by default. However, based on


individual customer needs it can be enabled in the BIOS.

When the BIOS generates host visible correctable errors, the events are
logged in the Technical Support Report (TSR) and Serial Presence Detect
(SPD).

The system receives a summary of single bit error correcting data per
DIMM once per day.

Two different error log values:

1. If the error log values are greater than existing, the prior values are
overwritten10. It is subject to a threshold.
2. If the on-DIMM error log is below the reported threshold, it reports zero
errors.

Uncorrectable Error (UCE) Logging Changes

Two error message codes for uncorrectable error are:


1. MEM7114 message: The error indicator recommends replacing the
DIMM with a MEM711411 message whenever an uncorrectable error is

10 It is overwritten with the highest historical value that is registered to


diagnose the error. Because of the space constraint in the SPD rather
than TSR, it is overwritten.
11 This error message code indicates a critical severity and recommends

the action of contacting support and requesting to replace the parts


(DIMM).

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 63


Memory

detected12 or consumed13. Recommendation is made regardless of


PPR error correction.
2. MEM5100 message: A MEM510014 message is generated when error
events occur in a mirrored memory15 region. In order to trigger a Mirror
Failover event, one DRAM must trigger an Uncorrectable Error. The
failed device needs replacement as part of the mirror remediation. Due
to the mirroring remediation, when running in Fault resilient mode there
should be no data loss.

Tip: Only host (CPU) ECC correctable errors are included in


the TSR memory log. On-die ECC single bit errors are not.

12 Memory patrol scrub uses the CPU memory controller to periodically


scan DRAM and correct any single-bit errors that it encounters. Demand
Scrub occurs when the memory controller encounters a correctable error
during a regular run-time read transaction and writes back corrected data.
13 Consumed by reading or writing to the impacted area.

14 This error message code indicates an informational message and

appears when an uncorrectable error occurs in mirrored memory. It does


not indicate any action, as it is just an informational message.
15 A mirrored region where data integrity is maintained by a mirror copy. In

the event of an UCE, the device where the error occurred is identified for
replacement. Mirrored memory is resilient to uncorrectable errors because
it has two copies of the data.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 64


GPU

GPU

Updating the NVIDIA Drivers

A common resolution to resolving issues is upgrading the component to


the latest firmware revision. When troubleshooting GPU related issue,
always verify the server is running the latest firmware version.

Scenario: An administrator contacts a Technical Support Engineer stating


that the graphics processing units (GPUs) are not compatible in their latest
environment due to unknown errors. The administrator wants to know the
information about the latest version of the GPU driver that should be
installed in their environment.

Solution:
1. Direct the customer to the NVIDIA driver reference portal.
2. Guide the customer to input the system details: Product Type, Product
Series, Product, Operating System, and CUDA Toolkit version.
3. Direct the user to click the Search button.
4. Request the user to download the latest version of the driver file and
install in the environment.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 65


GPU

Tip: For more information on toolkit and driver installation,


download the NVIDIA GPU CUDA Toolkit & Driver
Installation Procedure from the references

Important: The GPU drivers available on the Dell support


site are used to validate the GPUs in the chassis. It is
possible that a later version of the driver might be available
in the NVIDIA portal.

GPU XID Errors

Scenario: A customer contacts a Technical Support Engineer with a failed


GPU based on the System Event Logs (SEL). The customer indicates that
XID fault #9 is produced in the SEL. The customer wants to identify the
cause of the issue. This scenario shows how using the resources easily
identifies the failure, before acting on the error.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 66


GPU

Solution:
1. Collect the XID number from the user. In this scenario, the XID number
is 9.
2. Open the NVIDIA GPU Management and Deployment guide.
3. Verify the cause of the issue by referring to the XID number. In this
scenario, the cause is mapped to "driver error”. Problems in the core of
the driver cause the driver errors.
4. Direct the customer to update the driver to the latest version. If the
customer has the latest version of the driver, then ask them to install
the previous version of the driver and retest. The latest OS logs enable
the customer to understand if the issue is due to a specific build of the
driver or the GPU combination.

WDDM/VDI Modes for GPU

Scenario: A customer contacts technical support. The customer is unable


to use the GPU in WDDM/VDI mode. When identifying configuration
settings, the technical support may need to talk the customer through the
process of verifying the proper settings.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 67


GPU

Solution:
1. Direct the customer to verify the compatibility of the GPUs with
WDDM/VDI mode.
2. If the GPUs are compatible with Windows Display Driver Model
(WDDM) and Virtual Desktop Infrastructure (VDI) mode, ask the
customer to boot the server.
3. After booting the server, open a remote terminal session to the server.

− For the system with the latest version of the GPUs, run the
command " gpumodeswitch –gpumode graphics " to enable
WDDM/VDI modes.

− For the system with the previous version of the GPUs, run the
command " nvidia-smi -dm 1 " to enable WDDM/VDI modes.

NVIDIA A100 HGX 80/40 GB GPUs support the WDDM 2.0+ based
functionality, only if the virtual GPU (vGPU) software is installed.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 68


GPU

Important: Some GPUs may not have the ability to function


in the WDDM/VDI mode. The ideal way to resolve the issue
is to verify what modes are supported by a specific GPU.
Details about a GPU are available on the NVIDIA technical
support site.

Undetected GPU

Scenario: A customer contacts technical support and states that a GPU


core (A100) is not functional on their Linux server. The customer also
states that four physical GPUs are installed but only three are detected.
The customer wants to identify which GPU is causing the issue. When
analyzing issues such as unrecognized component or missing component,
the most common cause is component or cabling connections. This is
especially true if the technical support engineer sees that the server had
recent repairs.

Solution:
1. Direct the customer to verify all GPUs, and cables are seated correctly.
2. Ask the user to reboot the server. Ensure that the server is running
with the NVIDIA driver (or CUDA).
3. After booting the server, open a terminal session to the server and
type:
− # nvidia-smi --query-
gpu=gpu_name,serial,pic.bus_id,pcie.link.gen.curr
ent,
pcie.link.width.current,ecc.erors.corrected.aggre
gate.total,
ecc.errors.uncorrected.aggregate.total,temperatur
e.gpu.power.draw --format=csv
4. Identify the serial numbers of all GPUs from the obtained output.
− In this scenario, the user has installed four GPUs but the query has
displayed the serial number of only three GPUs.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 69


GPU

5. Login to the iDRAC to view the NVIDIA A100 GPU information.


− In the iDRAC, identify the missing GPU slot number to the
corresponding GPU slot number that is printed on the metal jig
inside the server.

6. The GPU serial number that is not displayed in the command output is
the non-functional GPU on the server.

− In this scenario GPU 0320415057649 in slot 24 is not displayed in


the command output.

GPU Memory Page Error

Scenario: A customer contacts technical support and states that the


server throws a bad memory page error on the GPU with serial number
"0320415054563". The customer wants to understand the reason for the
memory error.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 70


GPU

Solution:
1. Direct the customer to reboot the server. Ensure that the server is
running with the NVIDIA driver (or CUDA).
2. After booting the server, open a terminal session to the server and
type: # nvidida-smi -i 0320415054563 -q -d ECC
3. If the Single Bit Error (SBE) value is greater than 1000 or if the Double
Bit Error (DBE) value is greater than 10, it indicates that the GPU is
failing. If so, request the customer to replace the GPUs to avoid further
damage.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 71


GPU

a. In this scenario, the output shows that the SBE and DBE values are
within the limits, so GPUs have no issue.
b. The common cause of high error rates can be due to physical
damage to the GPUs or excessive heat for prolonged periods of
time.
4. Inform the user that the issue might be due to an unreliable memory
page and the page will be retired by the GPU. No need to take further
action in this scenario.

Important: The GPU memory pages become unreliable


when multiple Single Bit ECC Errors or Double Bit ECC
Errors occur on the same memory page. The GPUs are
internally designed to retire GPU memory pages when they
become unreliable.
A DBE causes the GPU to halt operations. The system
displays a fatal error in logs and provides the details of the
slot or the bus.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 72


Appendix

PPCM Examples

Backplane

The Portfolio Platform Configuration Matrix show the support and


configuration options.

PERC

Riser

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 73


Appendix

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 74


10 Gigabit Media Independent Interface (XGMII)
XGMII connects full duplex 10 GbE ports to each other and to other
devices on a printed circuit board. XGMII is typically used for on-chip
connections.

Dell PowerEdge Troubleshooting-SSP1

© Copyright 2023 Dell Inc Page 75

You might also like