Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 3

A1000's Most Common Problems Troubleshooting Guide

The purpose of this troubleshooting guide is to provide a general approach to solve some of A1000's top problematic areas reported by the field in the radiance database. These problematic areas are: battery FRU, A1000 controller/HBA and A1000 lun access. Preliminary inspection by Visual fault indications & RM6 healthck: Check the RAID module for any visible indications of failures like amber LED on controller/drives, power supply, fan modules, battery, bend pins etc. This quick diagnose could shed some light about the problem. Regardless any indications of failures are found, please run health check to allow the A1000 to detect the failures and to do an overall root cause analysis, and then follow the instructions provided to replace the component. Note: See Appendix A for health check procedures. If the steps in Recovery do not resolve the problem, please proceed the following steps to identify the root cause and fix for individual A1000 component. I. For A1000 battery FRU: Common problems - battery life expiration, or battery failure. The ultimate solution for these common problems is to replace the battery and then reset the battery age. 1. Determine the battery age by running the following RM6 CLI command: raidutil -c <device> -B For example: #/usr/lib/osa/bin/raidutil -c c1t0d0 -B LUNs found on c1t0d0. LUN 0 RAID 0 10 MB LUN 1 RAID 5 1000 MB Battery age is between 720 days and 810 days. raidutil succeeded! battery age between 630 and 720 days - near expiration battery age greater than 720 days - expired Battery should be replaced for the above cases. NOTE: A1000 battery is not hotswappable. 2. If the battery age is less than 630 days, and the fault LED or healthchk has indicated a failure, it is a battery failure case. Please gather the battery support information from the label on the battery canister, and record same in the Radiance case notes with indications of 'battery failed/non-expiration'. This information in the database is valuable for subsequent reliability analysis. The Radiance Support Type should be set to "Hardware On-Site". Battery support information example: Part number : Serial number : Date of manufacture : Date of installation: Date of replacement : 370-3417-01 17-digit number mm/dd/yy mm/dd/yy mm/dd/yy

After the above informations are gathered, please replace battery.

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

1. Battery (A1000 Only) You must turn off the power to replace the battery. 2. Push down the catch on the outside of the battery. 3. Pull the battery out. 4. Slide the new one into the battery port in the controller board. Be sure the battery is firmly seated. NOTE: A1000 battery is not hotswappable. 3. After battery replacement, run the following RM6 command to reset the battery age: raidutil -c <device> -R For example: #/usr/lib/osa/bin/raidutil -c c1t0d0 -R LUNs found on c1t0d0. LUN 0 RAID 0 10 MB LUN 1 RAID 5 1000 MB raidutil succeeded! 4. Run RM6 "raidutil -c <device> -B" command again to verify the battery age has been reset to zero. 5. Run RM6 Healthck to make sure the battery problem is fixed. Note: See Appendix A for health check procedures. II. For A1000 controller/HBA: Common problems - Unable to scan/access the controller, unresponsive/dead controller or offline controller. 1. Hook up the serial port to the A1000 and power cycle the controller. After the boot cycle completed, check the serial console for the following message: "NOTE: Logical Unit 0 is now optimal and online." If you couldn't see this message, there is some problem with the A1000 controller otherwise you can fairly be sure the problem lies with the host, its device entries, cable, terminator or HBA. Try doing a probe-scsi-all at the ok prompt. If you can see the A1000 then HBA/SCSI cable/terminator are good. 2. Check mismatch firmware/NVSRAM version. 3. Check rmlog to see if there are recent failures recorded. 4. Use CLI cmd "lad" to verify the controller is visible, this could find out if problems only existed in RM6 GUI. Note: See Appendix C for "lad" command syntax and example. 5. Check /kernel/drv/sd.conf for proper rm6 entries. 6. Reboot the host to fix any temporary problem with RM6. 7. "Unresponsive/Dead controller" could be cause by a power cycle of the controller during the OS's device scan, reboot the host after controller complete its initialization may fix the problem. 8. "Offline controller" could be cause by a fault HBA. Place the controller back online is needed after replacing the HBA. 9. Run RM6 Healthck to make sure the controller/HBA problem is fixed. Note: See Appendix A for health check procedures. III. For A1000 Lun access:

Common problems - Dead lun or Missing lun. 1. This could be cause of failed drive or wrong drive replacement. Replace the failed/wrong drive, format the lun and restore data from backup should fix the problem. 2. This could also be cause of an interrupted write process has failed. Stop all I/O to the lun, format the lun and restore data from backup should fix the problem. 3. Lun 0 is required for a normal communication between the host and the controller. Check if Lun 0 does exist or recreate it. 4. Check the System_MaxLunsPerController parameter in the RM6 with the current numbers of luns. 5. Run RM6 Healthck to make sure the Lun problem is fixed. Note: See Appendix A for health check procedures. Appendix: A. two ways to run healthck: cli: /usr/lib/osa/bin/healthck -a example: monty51# /usr/lib/osa/bin/healthck -a Health Check Summary Information monty51_001: Unable To Scan Module healthck succeeded! gui: /usr/lib/osa/bin/rm6 -> recovery guru -> stethescope icon -> Show Procedure button B. raidutil -c <device> -B example: #/usr/lib/osa/bin/raidutil -c c1t0d0 -B LUNs found on c1t0d0. LUN 0 RAID 0 10 MB LUN 1 RAID 5 1000 MB Battery age is between 720 days and 810 days. raidutil succeeded! C. lad example: #./usr/lib/osa/bin/lad c1t0d0 1T71523997 LUNS: 0 1 2

You might also like