High Bandwidth Memory HBM

HIGH BANDWIDTH MEMORY
HBM
Introduction
HBM (High Bandwidth Memory) is a new type of CPU/GPU memory chip (ie "RAM"). In fact,
many DDR chips are stacked together and packaged with the GPU to achieve a large-
capacity, high-bit-width DDR combination array.
Image Source – AMD

The middle die is GPU/CPU, and the 4 small dies on the left and right sides are the stack of
DDR particles. In the stacking, there are generally only three stacks of 2/4/8, and a maximum
of 4 layers in a three-dimensional stack.
When HBM (High Bandwidth Memory) existed as a GPU memory, it seems to be not
uncommon now. Many people may know that HBM is expensive, so even if it is not rare, you
can only see it on high-end products. For example, Nvidia's GPU for data centers. AMD's use
of HBM on consumer GPUs is a rare example.
Some gamers should know that HBM is a high-speed memory with a bandwidth far
exceeding DDR/GDDR. Its internal structure also uses 3D stacked SRAM. Some PC users have
imagined that if HBM memory can be used in general personal computers and notebook
products. Although the cost is high, there are many rich owners in this industry. Besides,
isn't the GPU just using HBM?
Will the HBM memory match the CPU?
In fact, the central processing unit that matches with HBM is not non-existent. The A64FX
chip used in Fujitsu's supercomputer Fugaku is matched with HBM2 memory. In addition,
Intel's soon-to-release Sapphire Rapids Xeon processors will have an HBM memory version
next year 2022. There are also such as NEC SX-Aurora TSUBASA.
Then we know that CPU with HBM is at least feasible (although in a strict sense, chips such
as A64FX have exceeded the scope of CPU), but these products are still for data center or
HPC applications. Is it because it is expensive, so it is not decentralized to the consumer
market? This may be an important reason or relatively close to the source. In this article, we
will take the opportunity of talking about HBM to talk about the characteristics and usage
scenarios of this kind of memory, and whether it will replace the very common DDR memory
on computers in the future.
Image source – Fujitsu

In terms of the common form of HBM, it usually exists in the form of a few die (package)
from the surface. It is very close to the main chip (such as CPU, GPU). For example, like the
picture above, A64FX looks like this, and the surrounding 4 packages are all HBM memory.
This type of existence is quite different from general DDR memory.
One of the characteristics of HBM is to achieve higher transmission bandwidth with a
smaller size and higher efficiency (partially) than DDR/GDDR. And in fact, each HBM package
is stacked with multiple layers of DRAM die, so it is also a 3D structure. DRAM die is
connected by TSV (Through Silicon Via) and microbump. In addition to the stacked DRAM
die, there will be an HBM in the lower layer Controller logic die. Then the bottom layer is
interconnected with CPU/GPU, etc. through the base die (for example, silicon interposer).
Image Source - AMD
From this structure, it is not difficult to find that the interconnect width is much larger than
that of DDR/GDDR, and the number of interconnected contacts below can be far more than
the number of lines connecting the DDR memory to the CPU. The implementation scale of
HBM2's PHY interface is not on the same level as the DDR interface; the connection density
of HBM2 is much higher. From the perspective of transmission bit width, each layer of DRAM
die has two 128-bit channels, and the total HBM memory with the height of the 4-layer
DRAM die is 1024 bits wide. Many GPUs and CPUs have 4 pieces of such HBM memory
around them, and the total bit width is 4096 bits.
For comparison, each channel of GDDR5 memory is 32bit wide, and 16 channels are 512bit
in total. In fact, the current mainstream second-generation HBM2 can stack up to 8 layers of
DRAM die per stack, which has improved capacity and speed. Each stack of HBM2 supports
up to 1024 data pins, and the transfer rate of each pin can reach 2000Mbit/s, so the total
bandwidth is 256Gbyte/s. Under the transfer rate of 2400Mbit/s per pin, the bandwidth of
an HBM2 stack package is 307Gbyte/s.
Image source – Synopsys

The above picture is a comparison of DDR, LPDDR, GDDR, and HBM given by Synopsys. You
can see the abilities of other players in the Max I/F BW column, which is not in the same
order of magnitude as HBM2. With such a high bandwidth, in applications such as highly
parallel computing, scientific computing, computer vision, and AI, it is simply a refreshing
rhythm. And from an intuitive point of view, HBM and the main chip are so close,
theoretically higher transmission efficiency can be obtained (from the energy consumption
per bit of data transmission, HMB2 does have a great advantage). In addition to the cost and
total memory capacity of HBM if it is really used as memory on a personal computer,
wouldn't it be perfect?
Advantages of HBM memory

Following are the benefits or advantages of HBM memory:
• It offers low power consumption compared to GDDR5 and GDDR6 versions.

• It has small form factor. Hence it offers high density compared to GDDR memory.
• It offers higher memory bandwidth.
• It supports higher capacity.
• It does not get heated much.
• It can be placed near to GPU die. This results into lower latency.
• It offers better performance. Operating speed of HBM memory is 1 Gbps.
Disadvantages of HBM
Poor flexibility
This type of memory of HBM was first initiated by AMD in 2008. AMD's original intention for
HBM was to make changes in power consumption and the size of computer memory. In the
following years, AMD has been trying to solve the technical problems of die stacking, and
later found partners in the industry with storage media stacking experience, including SK
Hynix, and some manufacturers in the interposer and packaging fields.
HBM was first manufactured by SK Hynix in 2013. And this year 2021 HBM was adopted by
the JESD235 standard of JEDEC (Electronic Components Industry Association). The first GPU
to use HBM storage was AMD Fiji (Radeon R9 Fury X) in 2015. The following year 2021
Samsung began mass production of HBM2-NVIDIA Tesla P100, which was the first GPU to
use HBM2 storage.
From the shape of HBM, it is not difficult to find its first shortcoming: the lack of flexibility in
system collocation. For PCs in the early years, the expansion of memory capacity is a
relatively conventional capability. And the HBM is packaged with the main chip, there is no
possibility of capacity expansion, and the specifications are already fixed at the factory. And
it is different from the current notebook equipment where DDR memory is soldered to the
motherboard. HBM is integrated on the chip by the chip manufacturer-its flexibility will be
weaker, especially for OEM manufacturers.
For most chip manufacturers, pushing processors for the mass market (including the
infrastructure market), based on various considerations including cost, is unlikely to launch
chip SKU models with various memory capacities. The processors pushed by these
manufacturers have various configuration models (for example, there are various models of
Intel Core processors)-if you consider the difference in subdivided memory capacity, the
manufacturing cost may be difficult to support.
Capacity is too small
The second problem with HBM is that memory capacity is more limited than DDR. Although
a single HBM package can stack 8 layers of DRAM die, each layer is 8Gbit and 8 layers are
8GByte. Supercomputing chips like A64FX leave 4 HBM interfaces, that is, 4 HBM stack
packages and a single chip has a total capacity of 32GByte.
Such a capacity is still too small for DDR. It is very common for ordinary PCs in the consumer
market to pile up more than 32GByte of memory. Not only are there a large number of
expandable memory slots on PCs and server motherboards, but some DDR4/5 DIMMs are
also stacking DRAM die. Using relatively high-end DRAM die stacking, 2-rank RDIMMs
(registered DIMMs) can achieve 128GByte capacity-considering 96 DIMM slots in high-end
servers, that is at most 12TByte capacity.
HBM DRAM die source: Wikipedia

Of course, as mentioned that HBM and DDR can be mixed together. HBM2 is responsible for
high bandwidth but small capacity and DDR4 is responsible for slightly lower bandwidth but
large capacity. From a system design perspective, the HBM2 memory in the processor is
more like an L4 cache.
High access latency
For PCs, an important reason why HBM has not been applied to CPU main memory is its high
latency. Regarding the issue of latency, although many popular science articles will say that
its latency is good, or Xilinx described its latency as similar to DDR for FPGAs equipped with
HBM, the "latency" in many articles may not be the same latency.
Contemporary DDR memory is generally marked with CL (CAS latency, the clock cycle
required for column addressing, indicating the length of the read latency). The CAS delay we
are talking about here refers to the waiting time between when the read command (and
Column Address Strobe) is issued and the data is ready.
After the memory controller tells the memory that it needs to access the data in a specific
location, it takes several cycles to reach the location and execute the instructions issued by
the controller. CL is the most important parameter in memory latency. In terms of the length
of the delay, the "period" here actually needs to be multiplied by the time per cycle (the
higher the overall operating frequency, the shorter the time per cycle).
GDDR5 and HBM

For HBM, one of its characteristics as mentioned earlier is the ultra-wide interconnection
width, which determines that the transmission frequency of HBM cannot be too high.
Otherwise, the total power consumption and heat cannot be supported and it does not
need such high total bandwidth.
The frequency of HBM will indeed be much lower than that of DDR/GDDR. Samsung’s
previous Flarebolt HBM2 memory has a transmission bandwidth of 2Gbit/s per pin, which is
almost a frequency of 1GHz. Later, there are products that increase the frequency to 1.2GHz.
Samsung mentioned that this process also needs to consider reducing the parallel clock
interference between more than 5000 TSVs, and increasing the number of heat dissipation
bumps between the DRAM die to alleviate the heat problem. In the above figure, AMD lists
the frequency of HBM in fact only 500MHz.
How suitable is HBM for PC Memory?
The characteristics of high bandwidth and high latency determine that HBM is very suitable
as a GPU memory because games and graphics processing themselves are highly predictable
and highly concurrent tasks. The characteristic of this type of load is that it requires high
bandwidth and is not so sensitive to delay. So HBM will appear on high-end GPU products.
According to this reason, HBM is actually very suitable for HPC high-performance computing
and AI computing. Therefore, although A64FX and next-generation Xeon processors are
CPUs, they will also choose to consider using HBM as memory.
But for personal computers, the tasks to be processed by the CPU are extremely
unpredictable, requiring various random storage accesses, and are inherently more sensitive
to latency. And the requirements for low latency are often higher than those for high
bandwidth requirements. Not to mention the high cost of HBM. This determines that, at
least in the short term, it is difficult for HBM to replace DDR on PCs. It seems that this
problem is similar to whether GDDR can be applied to PC memory.
But in the long run, no one can predict the situation. As mentioned above, a hybrid solution
can be considered. And the storage resources of different levels are undergoing significant
changes. For example, not long ago, we also wrote an article that AMD has stacked the L3
cache on the processor to 192MB. For the component of the in-die cache that originally
hides the external storage delay, as the cache on the processor chip becomes larger and
larger, the delay requirements for the system memory will not be so high.
HBM3
From the PC era to the mobile and AI era, the architecture of the chip has also moved from
being CPU-centric to data-centric. The test brought by AI includes not only chip computing
power, but also memory bandwidth. Even though the DDR and GDDR rates are relatively
high, many AI algorithms and neural networks have repeatedly encountered memory
bandwidth limitations. HBM, which focuses on large bandwidth, has become the preferred
DRAM for high-performance chips.
At the moment, JEDEC has not yet given the final draft of the HBM3 standard, but the IP
vendors participating in the standard formulation work have already made preparations. Not
long ago, Rambus was the first to announce a memory subsystem that supports HBM3.
Recently, Synopsys also announced the industry's first complete HBM3 IP and verification
solution.
As early as the beginning of 2021, SK Hynix gave a forward-looking outlook on the
performance of HBM3 memory products, saying that its bandwidth is greater than 665 GB/s
and I/O speed is greater than 5.2Gbps, but this is just a transitional performance. Also in
2021, the data released by IP vendors further raised the upper limit. For example, Rambus
announced that in the HBM3 memory subsystem, the I/O speed is as high as 8.4Gbps, and
the memory bandwidth can be as high as 1.075TB/s.
In June 2022 lats year, Taiwan Creative Electronics released an AI/HPC/network platform
based on TSMC’s CoWoS technology, equipped with an HBM3 controller and PHY IP, with an
I/O speed of up to 7.2Gbps. Creative Electronics is also applying for an interposer wiring
patent, which supports zigzag wiring at any angle, and can split the HBM3 IP into two SoCs
for use.
The complete HBM3 IP solution announced by Synopsys provides a controller, PHY, and
verification IP for a 2.5D multi-chip package system, saying that designers can use memory
with low power consumption and greater bandwidth in SoC. Synopsys’ DesignWare HBM3
controller and PHY IP are based on the chip-proven HBM2E IP, while the HBM3 PHY IP is
based on the 5nm process. The rate of each pin can reach 7200 Mbps, and the memory
bandwidth can be increased to 921GB/s.
At present, Micron, Samsung, SK Hynix, and other memory manufacturers are already
following this new DRAM standard. SoC designer Socionext has cooperated with Synopsys to
introduce HBM3 in its multi-chip design, in addition to the x86 architecture that must be
supported. , Arm’s Neoverse N2 platform has also planned to support HBM3, and SiFive’s
RISC-V SoC has also added HBM3 IP. But even if JEDEC is not "stuck" and released the official
HBM3 standard at the end of the year 2022.
Comparison of HBM Memory
References
www.utmel.com
www.jedec.org

High Bandwidth Memory HBM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Bandwidth Memory HBM

Uploaded by

Copyright:

Available Formats

HIGH BANDWIDTH MEMORY

Image Source – AMD

Image source – Fujitsu

Image source – Synopsys

Advantages of HBM memory

• It offers low power consumption compared to GDDR5 and GDDR6 versions.

HBM DRAM die source: Wikipedia

GDDR5 and HBM

Comparison of HBM Memory

You might also like