Rethinking Data Centers: Chuck Thacker Microsoft Research October 2007

Rethinking data centers
There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things Niccolo Machiavelli
Chuck Thacker Microsoft Research October 2007
Problems
Today, we need a new order of things. Problem: Today, data centers arent designed as systems. We need apply system engineering to:
Packaging Power distribution Computers themselves Management Networking
Ill discuss all of these. Note: Data centers are not mini-Internets
System design: Packaging

We build large buildings, load them up with lots of expensive power and cooling, and only then start installing computers. Pretty much need to design for the peak load (power distribution, cooling, and networking). We build them in remote areas.
Near hydro dams, but not near construction workers.
They must be human-friendly, since we have to tinker with them a lot.
Packaging: Another way

Suns Black Box Use a shipping container and build a parking lot instead of a building. Dont need to be human-friendly. Assembled at one location, computers and all. A global shipping infrastructure already exists. Suns version uses a 20-foot box. 40 might be better (Theyve got other problems, too). Requires only networking, power, and cooled water, so build near a river. Need a pump, but no chiller (might have environmental problems). Expand as needed, in sensible increments. Rackable has a similar system. So (rumored) does Google.
The Black Box
Suns problems
Servers are off-the-shelf (with fans). They go into the racks sideways, since theyre designed for front-to-back airflow. Servers have a lot of packaging (later slide). Cables exit front and back (mostly back). Rack must be extracted to replace a server
They have a tool, but the procedure is complex.
Rack as a whole is supported on springs. Fixing these problems: Later slides.
Packaging
A 40 container holds two rows of 16 racks. Each rack holds 40 1U servers, plus network switch. Total container: 1280 servers. If each server draws 200W, the rack is 8KW, the container is 256 KW. A 64-container data center is 16 Mw, plus inefficiency and cooling. 82K computers. Each container has independent fire suppression. Reduces insurance cost.
System Design: Power distribution

The current picture:
Servers run on 110/220V 60Hz 1 AC. Grid -> Transformer -> Transfer switch -> Servers. Transfer switch connects batteries -> UPS/ Diesel generator. Lots and lots of batteries. Batteries require safety systems (acid, hydrogen). The whole thing nets out at ~40 50% efficient, => Large power bills.
Power: Another way

Servers run on 12V 400 Hz 3 AC.
Synchronous rectifiers on motherboard make DC (with little filtering).
Rack contains a step-down transformer. Distribution voltage is 480 VAC Diesel generator supplies backup power. Uses a flywheel to store energy until the generator comes up.
When engaged, the frequency and voltage drop slightly, but nobody cares.
Distribution chain:
Grid -> 60Hz motor -> 400Hz generator -> Transformer -> Servers. Much more efficient (probably > 85%).
The Cray 1 (300 KW, 1971) was powered this way. Extremely high reliability.
The Computers
We currently use commodity servers designed by HP, Rackable, others. Higher quality and reliability than the average PC, but they still operate in the PC ecosystem.
IBM doesnt.
Why not roll our own?
Designing our own

Minimize SKUs
One for computes. Lots of CPU, Lots of memory, relatively few disks. One for storage. Modest CPU, memory, lots of disks.
Like Petabyte?
Maybe one for DB apps. Properties TBD. Maybe Flash memory has a home here.
What do they look like?
More on Computers
Use custom motherboards. We design them, optimizing for our needs, not the needs of disparate application. Use commodity disks. All cabling exits at the front of the rack.
So that its easy to extract the server from the rack.
Use no power supplies other than on-board inverters (we have these now). Error correct everything possible, and check what we cant correct. When a component doesnt work to its spec, fix it or get another, rather than just turning the feature off. For the storage SKU, dont necessarily use Intel/AMD processors.
Although both seem to be beginning to understand that low power/energy efficiency is important.
Management
Were doing pretty well here.
Automatic management is getting some traction, opex isnt overwhelming.
Could probably do better

Better management of servers to provide more fine-grained control than just reboot, reimage. Better sensors to diagnose problems, predict failures. Measuring airflow is very important (not just finding out whether the fan blades turn, since there are no fans in the server). Measure more temperatures than just the CPU.
Modeling to predict failures?

Machine learning is your friend.
Networking
A large part of the total cost. Large routers are very expensive, and command astronomical margins. They are relatively unreliable we sometimes see correlated failures in replicated units. Router software is large, old, and incomprehensible frequently the cause of problems. Serviced by the manufacturer, and theyre never on site. By designing our own, we save money and improve reliability. We also can get exactly what we want, rather than paying for a lot of features we dont need (data centers arent mini-Internets).
Data center network differences

We know (and define) the topology The number of nodes is limited Broadcast/multicast is not required. Security is simpler.
No malicious attacks, just buggy software and misconfiguration.
We can distribute applications to servers to distribute load and minimize hot spots.
Data center network design goals

Reduce the need for large switches in the core. Simplify the software. Push complexity to the edge of the network. Improve reliability Reduce capital and operating cost.
One approach to data center networks

Use a 64-node destination-addressed hypercube at the core. In the containers, use source routed trees. Use standard link technology, but dont need standard protocols.
Mix of optical and copper cables. Short runs are copper, long runs are optical.
Basic switches
Type 1: Forty 1 Gb/sec ports, plus 2 10 Gb/sec ports.
These are the leaf switches. Center uses 2048 switches, one per rack. Not replicated (if a switch fails, lose 1/2048 of total capacity)
Type 2: Ten 10 Gb/sec ports

These are the interior switches There are two variants, one for destination routing, one for tree routing. Hardware is identical, only FPGA bit streams differ. Hypercube uses 64 switches, containers use 10 each. 704 switches total.
Both implemented with Xilinx Virtex 5 FPGAs.

Type 1 also uses Gb Ethernet Phy chips These contain 10 Gb Phys, plus buffer memories and logic.
We can prototype both types with BEE3.

Real switches use less expensive FPGAs
Data center network

64-switch Hypercube
4X4 Sub-cube
16 links
4X4 Sub-cube
16 links
4X4 Sub-cube
16 links 16 links
4X4 Sub-cube
63 * 4 links to other containers
4 links
One container:
Level 2: 2 10-port 10 Gb/sec switches 16 10 Gb/sec links Level 1: 8 10-port 10 Gb/sec switches 64 10 Gb/sec links Level 0: 32 40-port 1 Gb/sec switches 1280 Gb/sec links
Hypercube properties
Minimum hop count Even load distribution for all-all communication. Reasonable bisection bandwidth. Can route around switch/link failures. Simple routing:
Outport = f(Dest xor NodeNum) No routing tables
Switches are identical.

Use destination based input buffering, since a given link carries packets for a limited number of destinations.
Linkbylink flow control to eliminate drops.

Links are short enough that this doesnt cause a buffering problem.
A 16-node (dimension 4) hypercube

3 2
3 3
3 0
0
0
1
1
1
5
1
4
1
1 2
1 2 2
1 0 0
2
3
3
3
7
3
6
3
3 2
3 0 0
3 2 2
3 0 0
10
1
11
1
15
1
14
1
1 2
1 0 0
1 2 2
1 0 0
8
3
9
3
13
3
12
3
Routing within a container

Trunk A Trunk B 16 Links (Two 2-link trunks) 10 Gb 10 Gb 10 Gb 10 Gb
2 Type 2 Switches
10 Gb
10 Gb
10 Gb
10 Gb
16 Links
8 Type 2 Switches
10 Gb
10 Gb
10 Gb
10 Gb
64 Links
32 Type 1 Switches
40 * 1Gb
40 * 1Gb
1280 1 Gb/sec links to servers
Bandwidths
Servers > network: 82 Tb/sec Containers > cube: 2.5 Tb/sec Average will be lower .
Objections to this scheme

Commodity hardware is cheaper
This is commodity hardware. Even for one center (and we build many). And in the case of the network, its not cheaper. Large switches command very large margins.
Standards are better

Yes, but only if they do what you need, at acceptable cost.
It requires too many different skills

Not as many as you might think. And we would work with engineering/manufacturing partners who would be the ultimate manufacturers. This model has worked before.
If this stuff is so great, why arent others doing it?

They are.
Conclusion
By treating data centers as systems, and doing full-system optimization, we can achieve:
Lower cost, both opex and capex. Higher reliability. Incremental scale-out. More rapid innovation as technology improves.
Questions?

Rethinking Data Centers: Chuck Thacker Microsoft Research October 2007

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rethinking Data Centers: Chuck Thacker Microsoft Research October 2007

Uploaded by

Copyright:

Available Formats

Rethinking data centers

Chuck Thacker Microsoft Research October 2007

System design: Packaging

They must be human-friendly, since we have to tinker with them a lot.

Packaging: Another way

The Black Box

Rack as a whole is supported on springs. Fixing these problems: Later slides.

System Design: Power distribution

Power: Another way

Why not roll our own?

Designing our own

What do they look like?

Could probably do better

Modeling to predict failures?

Data center network differences

Data center network design goals

One approach to data center networks

Type 2: Ten 10 Gb/sec ports

Both implemented with Xilinx Virtex 5 FPGAs.

We can prototype both types with BEE3.

Data center network

63 * 4 links to other containers

Switches are identical.

Linkbylink flow control to eliminate drops.

A 16-node (dimension 4) hypercube

Routing within a container

1280 1 Gb/sec links to servers

Objections to this scheme

Standards are better

It requires too many different skills

If this stuff is so great, why arent others doing it?

You might also like