Professional Documents
Culture Documents
CIR - PRB0055726 - ID - 0002411 - Final
CIR - PRB0055726 - ID - 0002411 - Final
CIR - PRB0055726 - ID - 0002411 - Final
Details
Release Date: 10 December 2020 Outage Start: 07 December 2020 07:23
Incident Description
During this incident, IBM Cloud services were not available because of a cooling problem in a non-IBM owned building that
houses the IBM Cloud data center in Sao Paulo, Brazil. Note: This is a multi-tenant building that included other Cloud service
providers, and other non-IBM co-lo clients. All were impacted by this event.
On 07 December 2020 at 07:13, a local IBM technician noticed an increase in temperature in SAO01. IBM Cloud DC
Operations began to execute its incident response plan and actions. At 07:23, the Data Center Aggregate Routers (DARS)
Investigation by the building owner discovered an issue with the main water supply that feeds the building's cooling towers. In
an attempt to avert service impact, Data Center Specialists deployed fans. The building vendor then worked to move cooling to
a secondary system, however it was also impacted by the water supply issue.
It took several hours for the building owner to make repairs to the water supply valves, refill the cooling system, and reduce the
interior temperature to a level that would allow IBM Cloud employees back inside the data center. Once back in the building,
the data center infrastructure had to be validated, and brought back online, ensuring there was no loss of data, and that
the initial service impact. Additionally, alarms were not detected by the building vendor to mitigate this issue.
During a building bi-monthly fan coil maintenance on 04 December 2020, two valves for fan coils were left open at the end of
the Method Of Procedure (MOP), affecting water supply to the cooling towers. This allowed the dual water tanks to begin to
drain. Following the maintenance, the building vendor did not verify that the valves were closed and did not detect the water
tank level alarms that were triggered. These alarms indicated a continuous loss of water in the dual water tanks.
Research into the alerting issue indicated that alarms remained open and were not properly programmed to send escalating
messages to the building vendor operations team. The building vendor operations and maintenance teams reacted to the event
Timeline
This timeline only reflects the alerting, troubleshooting, and mitigation of the top-level service degradation or disruption.
Additional dependent services or unique customer environments are not reflected here, and these might have experienced
1. 07 December 2020 07:13 - IBM Cloud DC Specialists alerts the IBM Site Manager about temperature alarms. Coordinated
response begins.
2. 07 December 2020 07:23 - Internal connectivity to the data center was lost. DARS auto shutdown. Incident started.
3. 07 December 2020 07:23 - IBM Site Manager notified Building Owner of rising temperature alarms. Building Management
informed IBM that they were aware of the issue and were working on it.
4. 07 December 2020 07:30 – IBM Cloud DC Specialists deployed fans in an attempt to reduce the server room temperature.
5. 07 December 2020 09:04 – Building Vendor updated the Site Manager and moved cooling to a secondary system. A brief
6. 07 December 2020 10:20 – Decision was made to start shutting down bare metal servers to reduce cooling load
7. 07 December 2020 10:30 – Water leak was repaired and water trucks dispatched to the site to replenish lost water.
8. 07 December 2020 11:20 - IBM Cloud personnel were vacated from the building because of elevated temperature.
9. 07 December 2020 11:53 – Building Owner requested authorization to power down Sao01, based on elevated
10. 07 December 2020 13:50 - IBM Data Center personnel were allowed to return to DC (visual inspection started).
11. 07 December 2020 15:30 - Network service team began recovery efforts.
14. 07 December 2020 20:30 - Storage internal management switches were online.
19. 08 December 2020 01:00 - ROKS restoration completed. All Compute services were restored.
Service Restoration
The Building owner hosting the IBM Cloud Data Center worked with the building vendor to close the valves and refill the cooling
tanks. IBM Cloud Specialists then restored Network and Cloud services, mitigating the impact and ending the incident on 08
Enhance checks and verification in the vendor MOP for data center maintenance 31 December 2020
Validate with the provider that appropriate thresholds and alerting is established and 31 December 2020
monitored
____________________________________________________________________________________________________
10 December 2020 PRB0055726