Professional Documents
Culture Documents
Rethinking Data Centers: Chuck Thacker Microsoft Research October 2007
Rethinking Data Centers: Chuck Thacker Microsoft Research October 2007
There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things Niccolo Machiavelli
Problems
Today, we need a new order of things. Problem: Today, data centers arent designed as systems. We need apply system engineering to:
Packaging Power distribution Computers themselves Management Networking
Ill discuss all of these. Note: Data centers are not mini-Internets
Suns problems
Servers are off-the-shelf (with fans). They go into the racks sideways, since theyre designed for front-to-back airflow. Servers have a lot of packaging (later slide). Cables exit front and back (mostly back). Rack must be extracted to replace a server
They have a tool, but the procedure is complex.
Packaging
A 40 container holds two rows of 16 racks. Each rack holds 40 1U servers, plus network switch. Total container: 1280 servers. If each server draws 200W, the rack is 8KW, the container is 256 KW. A 64-container data center is 16 Mw, plus inefficiency and cooling. 82K computers. Each container has independent fire suppression. Reduces insurance cost.
Rack contains a step-down transformer. Distribution voltage is 480 VAC Diesel generator supplies backup power. Uses a flywheel to store energy until the generator comes up.
When engaged, the frequency and voltage drop slightly, but nobody cares.
Distribution chain:
Grid -> 60Hz motor -> 400Hz generator -> Transformer -> Servers. Much more efficient (probably > 85%).
The Cray 1 (300 KW, 1971) was powered this way. Extremely high reliability.
The Computers
We currently use commodity servers designed by HP, Rackable, others. Higher quality and reliability than the average PC, but they still operate in the PC ecosystem.
IBM doesnt.
Maybe one for DB apps. Properties TBD. Maybe Flash memory has a home here.
More on Computers
Use custom motherboards. We design them, optimizing for our needs, not the needs of disparate application. Use commodity disks. All cabling exits at the front of the rack.
So that its easy to extract the server from the rack.
Use no power supplies other than on-board inverters (we have these now). Error correct everything possible, and check what we cant correct. When a component doesnt work to its spec, fix it or get another, rather than just turning the feature off. For the storage SKU, dont necessarily use Intel/AMD processors.
Although both seem to be beginning to understand that low power/energy efficiency is important.
Management
Were doing pretty well here.
Automatic management is getting some traction, opex isnt overwhelming.
Networking
A large part of the total cost. Large routers are very expensive, and command astronomical margins. They are relatively unreliable we sometimes see correlated failures in replicated units. Router software is large, old, and incomprehensible frequently the cause of problems. Serviced by the manufacturer, and theyre never on site. By designing our own, we save money and improve reliability. We also can get exactly what we want, rather than paying for a lot of features we dont need (data centers arent mini-Internets).
We can distribute applications to servers to distribute load and minimize hot spots.
Basic switches
Type 1: Forty 1 Gb/sec ports, plus 2 10 Gb/sec ports.
These are the leaf switches. Center uses 2048 switches, one per rack. Not replicated (if a switch fails, lose 1/2048 of total capacity)
16 links
4X4 Sub-cube
16 links
4X4 Sub-cube
16 links 16 links
4X4 Sub-cube
4 links
One container:
Level 2: 2 10-port 10 Gb/sec switches 16 10 Gb/sec links Level 1: 8 10-port 10 Gb/sec switches 64 10 Gb/sec links Level 0: 32 40-port 1 Gb/sec switches 1280 Gb/sec links
Hypercube properties
Minimum hop count Even load distribution for all-all communication. Reasonable bisection bandwidth. Can route around switch/link failures. Simple routing:
Outport = f(Dest xor NodeNum) No routing tables
3 0
0
0
1
1
1
5
1
4
1
1 2
1 2 2
1 0 0
2
3
3
3
7
3
6
3
3 2
3 0 0
3 2 2
3 0 0
10
1
11
1
15
1
14
1
1 2
1 0 0
1 2 2
1 0 0
8
3
9
3
13
3
12
3
2 Type 2 Switches
10 Gb
10 Gb
10 Gb
10 Gb
16 Links
8 Type 2 Switches
10 Gb
10 Gb
10 Gb
10 Gb
64 Links
32 Type 1 Switches
40 * 1Gb
40 * 1Gb
Bandwidths
Servers > network: 82 Tb/sec Containers > cube: 2.5 Tb/sec Average will be lower .
Conclusion
By treating data centers as systems, and doing full-system optimization, we can achieve:
Lower cost, both opex and capex. Higher reliability. Incremental scale-out. More rapid innovation as technology improves.
Questions?