Google Case Study-HLS

GOOGLE DEVELOPS WEBM VIDEO
DECOMPRESSION HARDWARE IP USING

TECHNOLOGY INDEPENDENT SOURCES
AND HIGH-LEVEL SYNTHESIS
W H I T E P A P E R
H I G H - L E V E L S Y N T H E S I S
w w w . m e n t o r . c o m
Google Develops WebM Video Decompression Hardware IP Using High-Level Synthesis
INTRODUCTION
The WebM project (www.webmproject.org) defines an open file format designed for the distribution of
compressed media content across the web. Google is a major contributor to the WebM project, having recently
undertaken the design and development of the first hardware decoder IP for WebM, otherwise known as the VP9
G2 decoder. This royalty-free hardware IP enables companies developing multimedia system-on-chip (SoC) designs
to deliver next-generation performance and power efficiency, enabling up to 4K (2160p 60FPS) resolution playback
on consumer devices such as smart TVs, tablet computers and mobile telephones, as well as traditional personal
computers and laptops. The VP9 G2 IP has been implemented with a completely new hardware architecture, coded
and verified primarily in standard C++, and synthesized to Register Transfer Level (RTL) logic for different target
technologies and performance points using Catapult® High-Level Synthesis (HLS).
This paper presents the HLS methodology used to develop the VP9 G2 hardware decoder and explains how it
supports the goals and strategies of the WebM project. It explains why the HLS approach makes design
implementation and verification 50% faster than a traditional RTL design flow, and how it enables design teams
with different end products to collaborate and contribute to the same IP.
Figure 1. VP9 G2 Decoder Hardware
This paper will also describe the actual use of Catapult HLS by the WebM team in the successful implementation of
the G2 VP9 and share results and impressions. Figure 1 shows the hardware, including both the HLS generated and
hand-written RTL blocks. For reference, this hardware is approximately two million gates in a 65nm TSMC
technology and supports 4K (2160p) @ 60fps.
w w w. m e nto r. co m
2 [8 ]
THE WEBM PROJECT AND G2 VP9

The primary purpose of Google’s WebM project is to improve the end user’s web video experience. The quality of
the video that users receive is largely determined by the compression formats used, which, unfortunately, are
evolving at a glacial pace compared to consumer expectations from online video. For example, the High Efficiency
Video Coding standard (HEVC, otherwise known as H.265) took ten years to evolve from H.264. Allowing a further
year for the IP designers to write and verify the RTL, gives a total of 11 years to advance the available hardware by a
single generation.
WebM Project targets much shorter codec design cycles, and is planning to update the open format video codec
every few years.
Among the major benefits of the WebM project are fostering hardware IP collaboration and accelerating the
innovation and deployment of new and better video compression standards. In the WebM model, Google provides
fully functional base IP to their semiconductor partners, who are encouraged to enhance the IP and share those
improvements with Google, allowing the IP to evolve rapidly.
VP9 decode support is currently available in more than one billion endpoints, including the Chrome browser,
Android, FFmpeg and Firefox. The VP9 hardware encoder and decoder developed using Catapult C can be
requested from the WebM website (http://www.webmproject.org/hardware/). The website contains details about
the performance of both the encoder and decoder.
CHALLENGES OF RAPID HARDWARE INNOVATION

Near the start of the VP9 G2 hardware project, the WebM team realized a new hardware design methodology was
needed to support rapid innovation. Ideally, the initial hardware and software would be delivered on the same day
as the specification, which meant designers need the ability to easily adapt and update the code as the
specification evolved.
The VP9 G2 video hardware had doubled in complexity over the previous hardware generation. This meant that
simulation runs would be too long, with the total verification effort expected to consume months. Also, in this
particular domain, the amount of test vectors is immense. The increased complexity would not only impact total
verification time, but also what could be tested in a reasonable time frame. With RTL simulation, the team could not
achieve the level of testing required without a major increase in testing resources.
Merging RTL changes from multiple design teams and companies targeting different product variations or
applications would also not be practical. IP written as RTL code contains a level of implementation specific detail
that significantly reduces the re-usability of the IP. If a designer wants to re-use that code on a different ASIC
technology, run it at a higher clock speed, or change the throughput, they have to do substantial rewrites of the
RTL or accept sub-optimal power, performance or area.
The WebM team started evaluating a number of higher abstraction level tool flows, and found the best match for
their needs in Catapult C.
3 [8 ]
ADVANTAGES OF USING C++ VS. RTL

C++ is a familiar language to most micro-electronics engineers; it is commonly used in both hardware and software
engineering. C++ descriptions represent a more abstract coding style than RTL, giving a description of an
algorithm and architecture, rather than cycle-accurate signal and register behavior. Much like the case with VHDL
and Verilog, there is a synthesizable subset of C++ that can be used for modelling and designing hardware. The
overwhelming majority of the standard C++ constructs and methods can be used, with a few exceptions which
rely on an underlying software processor architecture to execute (for example “malloc”), which would make no
sense in a hardware implementation. Compared to RTL, a synthesizable C++ representation has, on average, 80%
fewer lines of code, with reduced verbosity allowing it to be more meaningful to humans, making it easier and
quicker to debug. Like-for-like functionality usually simulates 50-1000x faster in C++ than in RTL, and hardware
developers using C++ are able to design and debug complex hardware in about half the time.
C++ models are often developed as golden reference models against which the hardware design is verified. These
reference models are used as starting points for the hardware implementation. If the C++ model is itself
synthesizable, a manual rewrite into RTL can be avoided, and a smooth handoff between software or algorithm
engineers and the hardware team is enabled. This reduces the opportunity for errors due to ambiguous
specifications and misinterpretations. During the hardware design process, both system architects and hardware
design engineers can work with the same shared code base. Sharing executable code in this way means that
concepts are easily communicated, ideas are engineered with no risk of misinterpretation, and a single specification
can be shared and unambiguously used by all, from concept to implementation.
Sharing the same code across many different groups also requires a standardized environment. Functional
verification of a design modelled in C++ can be performed in any one of a number of industry standard compilers
and debug tools, such as gcc, or MSVisual C++. There are many other tools available for analyzing the source code,
for version controlled source code databases and for merging C++ changes from multiple developers.
In the case of WebM team’s hardware IP development, the C++ code and standardized development environment
could be shared with their IP partners while it was still under development. In turn, the partners were able to share
both insights and bug fixes during this process so production hardware could be delivered at virtually the same
time as the standard was finalized and issued.
An important aspect of being able to accurately describe hardware in C++ is the use of bit-accurate datatypes,
using C++ class libraries to allow execution in any standard C++ environment. Other hardware design features, such
as clock frequencies, concurrency, register and component sharing and many other micro-architecture details are
not written into the C++ code. Only algorithm and functional behavior is described in the C++ model. This means
that the same C++ representation can be easily re-targeted between different micro-architectures or performance
points and to different implementation technologies (both ASIC & FPGA) by simply modifying the commands and
constraints used to drive the synthesis tool.
In summary, a C-based HLS flow dramatically reduces the overall RTL verification effort because it allows
engineering teams to more rapidly test each change to the source code and to share code across different
hardware and software teams. Less low level implementation detail in the source code enables faster simulation
and quicker debug and modification. Higher simulation performance means more tests can be run to more fully
exercise the source code, with industry standard tools used to monitor and check the functional coverage provided
by the test sets. Modifying and re-verifying C++ models to perform a series of what-if evaluations of alternative
algorithms and architectures is fast and efficient, providing the ability to choose the optimum implementation
based on real power, performance and area, rather than theoretical estimation. This is how C++ supports Google’s
rapid innovation goals.
4 [8 ]
RESULTS AND IMPRESSIONS

Bit-Accurate Architecture
C++ Reference (Block IO, Dataflow)
Specification
Distribute blocks to
team members
Stitch in RTL
Simulate and FPGA Prototype
Standard RTL Synthesis Flow
Figure 2. G2 VP9 design and verification process
To manage risk on their first HLS project, WebM team decided to implement each block of their video decoder as a
separate project. This flow allows multiple blocks to be optimized in parallel by different engineers while the top-
level interconnect model is written by hand. This also allows Catapult to be used where it provides the most
benefit, on algorithmic blocks where some exploration is needed to decide on the most optimal architecture.
Other blocks, that contained SoC integration critical parts (e.g. clock gating, SRAM container), were implemented in
RTL. The result is the hardware partitioning shown in Figure 1 and the resulting design and verification process is
shown in Figure 2.
The WebM engineers needed training to get started with HLS, but soon realized that writing in C++ has the same
“feel” as writing in VHDL or Verilog. Just like in RTL, a designer needs to start by visualizing the hardware they want
to build and then write code to target that hardware. The main thing to learn is what C++ code to write.
By moving to the HLS with C++ flow For the VP9 G2 hardware, the team realized several benefits:
1. The total lines of code for a 14 block design was approximately 69,000. The hardware design team estimates
that an RTL-based approach would have needed roughly 300,000 lines of code to describe the same blocks.
2. Simulation runs were more than 50 times faster in C++ than RTL. This greatly reduced the “tail” on the
verification effort because developers could work on code all day, start regressions when they left in the
evening and have results the next morning for every single test in the suite.
3. Using C++ enabled IP collaboration by allowing multiple contributors to share enhancements to the same file
and by supporting standard tools and processes to merge the changes.
4. HLS could run on each block in about an hour. This allowed fast exploration of different architectures for the
blocks, either by modifying the C code or re-constraining the tool. The total effort to design and verify the
hardware took about six months vs. an estimated one year to write the code by hand in RTL.
5 [8 ]
EXAMPLE OF USING CATAPULT HLS FLOW FOR

IMPLEMENTATION OF G2 VP9 DECODER IP
The inter prediction block is one of the most complex pieces of the VP9 hardware. It is shown in Figure 1 as part of
the Inter/Intra Prediction block. This block is responsible for calculating the pixel predictor between consecutive
frames of video.
Core Engine
Figure 3. High-level block diagram for the inter prediction block
As shown in Figure 3, there are eight processes in the inter prediction block. The block is controlled in three ways.
First, memory mapped registers are used to configure the block. Second, a stream of commands is sent to the
control process, which then issues commands to the core engines in the intra prediction block. A separate pre-
fetch control process also monitors the state of the reference memories in the design and does a pre-fetch for new
data. Finally, a stream of data is fed to the Write Ref Mem process using a flow control handshake.
This architecture allows the inter prediction block to pre-fetch data as soon as there is enough memory available in
the reference memories. The core Ref and Pred engines are then controlled to process the data as quickly as
possible before the forward and backward prediction data is combined in an output stream. The inter prediction
block then forwards the data and commands to the next block in the sub-system, which is the deblocking filter.
In hardware, each block must run in parallel and often at a different rate. However, the source code is sequential
C++, so the Algorithmic C (AC) ac_channel datatype is used to model the parallelism and different rates between
the blocks. The following code is a simplified version of the source code for the top level of the inter prediction
block.
6 [8 ]
#pragma hls_design top

void pred_inter(SwRegisters &sw_registers,
ac_channel<ControlVect> &control_in,
ac_channel<RefData> &ref_data,
RefMemType &ref_mem,
ac_channel<ControlVect> &control_out,
ac_channel<PredInterData> &pred_inter_data_out,
uac1 &prefetch) {
/* Local declarations of channels are not shown

They are declared using static ac_channels
static ac_channel<data_type> channel_name;
*/
// control processes
ReadReferenceControl (sw_registers,
control_in
control_out,
core_engine_control,
combined_control);
SendIOToken(sw_registers,
io_token_request,
prefetch);
// Data processes
WriteRefMem(sw_register,
ref_data,
ref_mem);
// This hierarchy contains Read Ref Forward, Read Ref Backward

// Pred Forward and Pred Backward
CoreEngine(sw_registers,
core_engine_control,
io_token_request,
pred_inter_fwd,
pred_inter_bwd);
Combine (sw_registers,
combine_control,
pred_inter_fwd,
pred_inter_bwd,
pred_inter_data_out);
The inter prediction block is about 8000 lines of C++ code, including all of the associated header files. The structure
of the design is described using ac_channel and function calls. Each function is then mapped to a process or
additional level of hierarchy. The direction of the ac_channel is determined by how it is used in the C code and
Catapult checks that each channel is only written by one process.
7 [8 ]
In C++ simulation, the ac_channel is a FIFO with unlimited depth, which allows simple hierarchical sub-systems to
simulate as if the functions run in parallel. The source code is then simulated inside the full sub-system in C++ to
confirm that both the hardware and software are working correctly. In order to test this block, each function call
contains a loop that iterates until all input data is consumed.
Next, the pred_inter function is synthesized into RTL. During synthesis, the functions are converted into parallel
processes with fixed depth FIFOs between the blocks. Catapult generates a SystemC wrapper for the RTL, which
allows the original test environment to be used to confirm the RTL is functioning correctly.
The RTL for pred_inter then needs to be integrated with the other generated RTL blocks, as shown in Figure 2. For
the VP9, this integration is done by hand in Verilog. Figure 1 shows the blocks in this sub-system and the hand-
coded FIFOs that were used to connect them together. This sub-system is then exercised using the same
simulation vectors used to test the original synthesizable C++ sub-system.
Finally, the RTL is targeted for either an FPGA for prototyping or an ASIC for final implementation. The C++ code is
the same for both targets, allowing easy FPGA prototyping without sacrificing the quality of the final ASIC
implementation.
Like the rest of the VP9 sub-system, this block was originally developed only for VP9 and then was enhanced to
support H.265. Depending on a compile-time switch, the whole sub-system can be re-configured and re-optimized
to either support only VP9 or to support both VP9 and H.265.
SUMMARY AND CONCLUSION

Like many cutting edge hardware design teams, the WebM hardware team needed to find a better way to build
their hardware. Verification was expected to consume a large part of the development effort and the hardware was
difficult to re-use. Most importantly, this meant the team could not spend the time they wanted to on building the
smallest, fastest and most power efficient hardware possible.
Now that the Google engineers have finished their first project, they have learned to “see hardware” while they
write synthesizable C++. They have also learned what kind of code to write for a specific kind of hardware. As of
March 2015, the team has also completed the VP9 hardware encoder. This hardware can also be found on the
WebM website mentioned in the introduction.
For the latest product information, call us or visit: w w w . m e n t o r . c o m

©2015 Mentor Graphics Corporation, all rights reserved. This document contains information that is proprietary to Mentor Graphics Corporation and may
be duplicated in whole or in part by the original recipient for internal business purposes only, provided that this entire notice appears in all copies.
In accepting this document, the recipient agrees to make every reasonable effort to prevent unauthorized use of this information. All trademarks
mentioned in this document are the trademarks of their respective owners.
MGC 01-16 1033800-w

Google Case Study-HLS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Google Case Study-HLS

Uploaded by

Copyright:

Available Formats

GOOGLE DEVELOPS WEBM VIDEO

DECOMPRESSION HARDWARE IP USING

Figure 1. VP9 G2 Decoder Hardware

THE WEBM PROJECT AND G2 VP9

CHALLENGES OF RAPID HARDWARE INNOVATION

ADVANTAGES OF USING C++ VS. RTL

RESULTS AND IMPRESSIONS

Simulate and FPGA Prototype

Standard RTL Synthesis Flow

Figure 2. G2 VP9 design and verification process

EXAMPLE OF USING CATAPULT HLS FLOW FOR

Figure 3. High-level block diagram for the inter prediction block

#pragma hls_design top

/* Local declarations of channels are not shown

// This hierarchy contains Read Ref Forward, Read Ref Backward

SUMMARY AND CONCLUSION

For the latest product information, call us or visit: w w w . m e n t o r . c o m

MGC 01-16 1033800-w

You might also like