Facebook Is Doing It With The Open Compute Project

Facebook is doing it with the Open Compute project.
Now we ask Facebook if they'd come on

camera and show us how they're doing their data centers and they said no, we can't go on
camera. So the solution is. William and I have just spent the last hour going through this entire
Iraq with the Facebook people and so him and I are going to do the best to present it. So I guess
William, we'll start with this. What is the reason for having a new rack? We've done it the same
way. What makes this better? So for a large hyper scallar someone like Facebook, they can
afford to change up almost all of the rack designs to meet the efficiency requirements and the
cost requirements of their data centers in a way that is maybe more opal than the standard
nineteen inch rack. So for the rest of the industry, it made more sense where everyone has the
same standard rack and because they had to interoperate, they weren't able to really play with the
form factors or play with the different types of sleds and switch and power configurations. That
someone who has sort of more budget and more flexibility where they have their own data
centers, they now have this ability to play with this stuff. Facebook has a ton of videos and
pictures that people are storing, so they need a lot of processing power to process those videos.
They need a lot of storage to store all of that data. Amazon, for example, needs just a lot of
processing power, they don't necessarily need any GPU, not any but not much GPU
performance, and they don't necessarily need a lot of storage. So this rack can be completely
configured one way for Amazon and configured a completely different way for Facebook. That's
right. Yeah, absolutely, it's great. So yeah. Amazon could adopt this. Google could adopt this.
Microsoft could adopt this. All the specs are online at the Open Compute Foundation. They have
links for Facebook stuff. Every one of these designs that you see here does have pdfs. It has
board schematics. It does everything now, one of the things they've actually put a lot of thought
into how they're going to put this rack together. So you'll notice as we go through with this
component. By component, all of these things are toolless. It means the data center text can get
in, get the problem fixed and get back out. We're also going to talk a little bit about the onboard
controllers, the open BMC modules that are going to allow these machines to talk out to the data
center technicians and they will actually be able to track tickets and identify problems before.
Even a human has noticed they exist, so a fan goes out in the back. They don't have to have a
data center tech walking around to see if the fans are running. No, it throws an alert and that
openbmc controller will contact and file a ticket on its own and say. Hey, help me. One of my
fans has died, so this is probably a cluster. As compared to what you would probably see in
production. It would be more singularly, it would be more homogeneous, so you'll see here in
our overview. There's going to be a lot of different types of machines in this rack, but you
wouldn't normally see that kind like kind of configuration in a Facebook data center. Ye, they'll
space those things out like you'll have a whole bunch of GPU compute. Doing one specific set of
task. You have a whole bunch of storage. Maybe you may just have a whole bunch of regular
CPU compute type nodes all in one area. We'll start with the actual rack design and then we'll go
Sled by Sled and show you what each one of these devices do and how they're put together. Let's
take a look at that. Let's start with the actual design of the rack itself. Now the rack is a different
dimension than your standard nineteeninch track. That's right, if you look at this rack, it's
actually 21 inches wide. You'll get a better shot from the front where you can see they've actually
installed switches that are 19 inch compatible but have little expanders to fit the 21 inch rack
here. So the rack itself actually has this backplane and inside of this you can actually see there's
two silver bars and those are actually providing 12 volt power to the entire act. So this power
supply unit which actually contains also the battery backup for the UPS is supplying DC power
to the rack which eliminates a lot of the power andefficiencies that you have in a tradition RA
because you're not converting from AC back to DC, back to AC, back to DC. As we get into the
components, all the components in the rack are going to be DC. The backplane of the rack is DC
and so we just make that conversion once as we have take AC power in, convert it to DC as well
as provide the DC power for the battery backup. Yeah, that's definitely something that's
interesting and different from your normal rack because on your normal rack you have UPS
battery banks that are going from AC to AC and then you're incurring the loss of going through
the UPS and incurring the loss again. When you go into the server and the other nice thing about
doing DC in these centralized locations, you'll see there's one up here and there's actually one. If
I reach down, there's one down here. If you do it in these centralized locations you can have
more efficiency gains by having larger rectifiers and then your larger battery banks and stuff and
you're not going back to AC again and all sorts of other. Sort of points of inefficiency we'll go
through as we get back to the front side. When we pull the sleds out, you'll be able to see this.
Certain sleds actually have a rail tray that come out towards the front, so some of these sleds can
actually stay powered on even while they're being serviced. But we'll start with, i guess. William,
why don't you show us these fan designs a modular way to replace fans quickly. They've added
the fans on the back here and they're not actually attached to the trays that can be pulled out the
front and all you have to do is unscrew the fan and it pops right out of the thing so you can take
the fan module out and that the fan brakes. You just unscrew it from the back. No tools needed.
There is a screwdriver thing on here, but it's a thumbcreew and then you can just take that, plug it
right back in if you have a new one and screw this thing down now, they've actually optimized
this in their future designs. You'll see down here if I go and I pull on this fan unit. They've
actually added in a handle and a lever so that you don't even need to use a thumbscrew anymore.
And it's got both of the large fans on it there and they plug in through the backplane and you get
the same metrics. You get fan speed all the stuff right to the VMC. All the designs that are kind
of incorporated into this rack is it's meant to be cost efficient, so they're kind of removing pieces.
They don't need everything's very kind of industrial and I wouldn't call it unfinished looking, but
it's just not fancy like some of the enterprise hardware, you see where they have these weird,
cool, whimsical designs and stuff going on. This is all very functional and robust, and it's cheap,
easy for text to replace all the things they want in the data center where they're managing.
Thousands of these machines and the standards are open so anybody can implement this yep, and
you can't necessarily tell who it is unless you've seen it before. Right like you can't know this is a
Facebook server without having asked people if Facebook or people who know and are familiar
with the matter. It doesn't say Facebook on it or anything. Let's dig into the actual components
that go into the rack and show you how that works all right. So let's start with the actual top of
the rack. And again, like William said, you can see here these are standard 19 inch switches, but
they have adapters that actually expand them out to the full depth. Now the truth is, they're not
standard 19 in switches. These are actually something pretty special. Yes. So what's interesting is
they've actually adapted the power supply modules so that it'll pass the DC from the back of the
rack back of the power pullles that we just saw before straight through into the switch without
requiring the inverter to AC and then rectifying back down at the switch level. And the switches
aren't just normal switches, they're actually white label switches that are designed with open
standards as well. That's right. So there's a company called Barefoot Networks as one of the
vendors who will make these and sell them to your retail. And I believe if I'm not mistaken and
they've looked at the designs in the past, they're just Broadcom switches on the inside. Switch as
six from Broadcom. There's an exit six board in there based around an Atom or a zoned platform,
and then a BMC to go with that, so it kind of fits into their typical server design model all right.
William let's take a look at this very first one, go ahead and pull that out all right. So here is a
disk tray that is just a disk tray and a controller so that you can get sass out to a machine above it,
so let's start with that. So in a traditional server you would have a processor and a disk and a
graphics card and a network card. In this case, this entire tray is basically just a big disc. It's
JBOD, just a bunch of DISS. You got a controller here for doing your external SAS, and it's
probably got a switch chip so it can do more SAS than what is subscribed here on the port, and
you would connect that up to one of the machines above it as a host where it would host all the
dicks to the network presumably. All right now let's pull out the other tray that actually has a
host attached to it because that is one of the options we have with this design. So you'll see here
this one. You can see the memory modules on top and the CPU is kind of contained within. But
this host was attached to the same exact tray of disks. But now you get a network connection out
so you can connect the disks directly to a network with a presumably fairly low power CPU like
a Z on D or an atom or something. And then you get all of the 15 disks of capacity in one of the
open compute rack units of space. All right we're back to the front of the rack. Let's take a look
at the power supply units because like I say this is pretty interesting. It's handling both DC and
AC, so this. I assume is where the battery units would go. And then above this this is where the
actual power supply units are. Tell me about those, yeah, that's right, so you can actually see on
here. They haven't included the battery units, but they are required to run a rack, and you'll see
there are actually labels on the front. It may be too far away, but they put the batteries below the
power supplies, and we can actually take these power supplies out without any tools like
everything else on the rack. And it's a giant monster of a power supply that supplies power for. I
don't know maybe 1520 machines between all three of these power supplies, the average on the
back. The Hefty Banks. Because they have these huge metal rods that go to the 12 volt rails in
the rack, and then you can see some of the inputs there as well. The average power draw power
budget, as they put it, is ten. I think you said ten to 12 kilowatts, that's right. You have three of
these up here and three below. So between all six you're doing ten to 12 kilowatts. Now the
batteries are required like you said to run the unit. Now that's not actually a physical requirement.
They've actually enforced that in firmware that's right. So if you actually were able to look at the
back of this unit, there's a management controller and it is able to tell you the status of these
power supply units and the status of the batteries. And it's able to run tests on the batteries to
make sure they're performing optimally, can drain them down, bring them back up, make sure
they don't explode these sorts of thing. So it can be doing testing all the time and ensuring this
thing is running optimally. What's this next sled that we have down here? Yes. So the next sled is
kind of similar to the first one we talked about. This one was hard disks and it was all SAS
Ortabad. This one is actually all names, so it's your newer tier of storage. So if we pull this guy
out. It'll look fairly similar to the one above, so it's going to be 15 disks wide and three deep, so
you get or sorry five discs wide, 15 total. And you'll see that it has a similar but newer connector
for the Dicks and you'll see there's a lot of metal going on here, and so if we take one of these out
which apparently we are unqualified to do.
You'll see it's just a metal tray and it hooks up with your typical Udo two style connector on the
back here and then we can take off the top piece to expose the insides and it just fits standard M
dot two drives inside of this guy. So whereas the other sled that was doing regular hard disk, so
this one is actually met. Designed for flash storage. It's designed for flash storage and by the
looks of it, designed for a very high heatproducing flash storage that. Something you might not
think about in a traditional server is a lot of graphic power, but that's something that Facebook
needs. Let's take a look at this graphic sled William, show me what we got going on. We got this
sled. It's called the Big Basin. It's what they put eight graphics processing units in, so if you see
here, we've got all the power delivery stuff in the back, which if you look is an insane amount of
power delivery going on. And this is based on nvidia's reference design for their P-one 100
platform. And you can see we have eight different gpus inside of this tray and they all connect
over pcie and in the Nvidia case they do have envylinnk connections between some of them and
you can see in the back we have the pcie switches which interconnect all of this and it exports
pcie out the front here via these little cards, which in some cases have retimers and sometimes
they're just really simple. Dumb pass troughs. What's interesting about that is a lot of the Open
Compute Rack is designed around commodity hardware is designed around standard connection,
so this is just a standard pcie slot and in this case they're using it for an interconnect to the rest
of. The sleds but you could use it to add a network card you could use it to. I assume, even add a
graphics card, ye. So what's really interesting is you'll notice this tray only has gpus. There's a
management controller in the back to make sure the fans are running to make sure the gpus are
actually up and online and just to sort of communicate with the rest of the Health Status
Network. But there are no cpus in here and so they actually use the PCI Express over these
connectors to the host machines in the rack in order to do any useful work. So we're making our
way down the RA. Now we showed you the 15 disk regular Hard Disk Controller. We showed
you the flash controller. William, let's go ahead and open this one up. This is a monster. 72 dis
Hard Disk Controller. William, tell me about this, yeah, so this is what they call the Bryce
Canyon Version Two. And this holds 72 different independent sas hard dicks and you can see
here. They're all stacked vertically and the backplane is kind of down in the back and it has two
different machines to control it. So it controls the front 32 discs is one machine and the back 32
discs a separate machine. They've just put it all in one location so that it's easy to maintain. All
right. So in here in this center area, we've got the power cabling and you'll see if you push it in
and out, it actually unrolls in this Little plastic guide that they've got going and you'll see. We
also have. These trays with these very interesting looking connectors and the connectors are
feeding sas or seta depending on what types of dicks you have to the dicks and then they're
feeding pcie from these saas controllers that are located on this board back to the host machines
in this front part of the unit. Now the interesting thing about that this cable unit that's rolling up
and rolling out is that this unit can actually stay powered on as it's being service. So as it rolls out
we talked to you earlier about the power rail that runs underneath for some of these units. This
would not work in this particular case because of the sheer power requirements to run 72 disks so
they actually run separate cabling. Inside of here so that this unit can stay powered on as it's
beingracked. Yeah, that's right. One of the goals of all the equipment here is to be able to have it
as maintainable as possible while also preserving uptime. So in almost all cases you can pull all
the units almost completely out of the rack and service individual parts like swapping disks or
swapping even pcie cards and those sorts of things without taking the unit offline. Or in the case
where you have multiple machines which will show you later, you can pull one machine out of
the tray without powering off the other machines that are also in that same tray. William, let's go
ahead and dig into this. I want to see what's underneath these pieces here. Yeah, so under here
we actually have both of the management computers, so these are Xity six machines and you'll
see later. These are actually taken from a similar design that is just meant for compute. So these
guys you get a single most likely Z on D. We're looking at some kind of BGA product under
there. Machine with four different dimes probably two dimes per channel, two memory channels
that sort of thing right now. It's populated as you can see with two piece of sticks of memory, and
it also has support for an MDOT two boot drive on there, but this is essentially entire. It's an
entire computer right in one unit computer that's right. It's an entire computer. You can boot this
if you just provided power and maybe networking. You could boot this and use it as something
without the rest of the disks. So it's acting as the host so that you can get these disks onto the
network and it plugs right into the backplane and then hooks up to the SAAS controllers over
pcie. You could boot this and use it as something without the rest of the disks. So it's acting as
the host so that you can get these disks onto the network and it plugs right into the backplane and
then hooks up to the SAAS controllers over pcie. Every one of these but there's essentially a
backplane that connects all these together isn't that right? That's right? So if you look underneath
these modules there's a little baseboard with a management controller and a nick hooked up and
that connects all four of these computers to the network using kind of their centralized
infrastructure and something to note in the rack here you can see from the unit that was taken
out.
There's actually a power rail that runs all the way from the back to the front of the machine. All
of that is solid state, so you don't have like an extra cable wiggling around and that provides
power to these machines even when the unit is pulled all the way out of the rack like this.
So when we pull it out this way it would still stay on and you could pull an individual machine
out and service it without taking the other ones offline or even disconnecting them from the
network. Okay. So we've pulled one of these compute sleds out and we now have it sitting on a
table. So let's go ahead and pull out one of these computers and tell me what's inside of one,
yeah. So if you look this again is fully tool less. We can grab the machine. We can just pull it
right out by opening these latches and that allows us to service the individual unit. So you'll see
this unit is very similar to the design of the one we pulled out of the storage server. It's got that
individual machine and it's flanked by eight dimes. In this case, this is actually one of the newer
ones and you'll see it's all self-contained and on the back. We don't have storage on the front this
time we flip it over, we go to the back. You'll see. There are these metal shrouds where you can
put MDOT two cards and in this case it has three of them and again tool list. You can pop these
right off by pulling on the connector tabs that would pop it off and expose an M dot two
connector and then it's seated using the same two pcie looking devices and then it just goes right
back in the way it came out. So each one of these sleds essentially contains four independent
servers that contain storage. It does processing, it does memory. The whole nine yards are all
containing each one of these units, and then it connects to a backplane. Now it's interesting these
devices actually are sharing a single nick that's right, so there's going to be one network interface
cable coming in on the one nick on the front. You can see there's a single port that will be for
your BMC and for all four nodes and they'll roughly share about ten gigabits per node. I think
this might be a 40 or 50 gigabit nick in here. This is absolutely a better way to build a mousetrap.
This was a very cool piece of technology and a huge congratulations to not only the Open
Compute project who are designing this, but companies like Facebook and Google and all of the
other companies that are putting these in their data centers because they are fundamentally
changing the way that we do servers and data ras. I mean, what do you think? William Pretty
cool right?

Facebook Is Doing It With The Open Compute Project

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Facebook Is Doing It With The Open Compute Project

Uploaded by

Copyright:

Available Formats

Facebook is doing it with the Open Compute project.

Now we ask Facebook if they'd come on

You might also like