Professional Documents
Culture Documents
An MPEG 2 Decoder Case Study As A Driver
An MPEG 2 Decoder Case Study As A Driver
abstract level. We felt that early feedback on this new Application explore
methodology was essential for the fur- ther development of Kahn model
concepts and tools. For example, does the designer get relevant Application C-code
Mapping
data on the performance of proposed archi- tectures, can Performance
analysis
simulations be performed efficiently, etc. Therefore, we decided to Architecture Architecture
specification model
perform an industrially relevant case study with the newly
developed prototype tools. We selected an ATSC compliant explore
Architecture
MPEG-2 video decoder as benchmark application. MPEG-2 video blocks
decoding is a high performance signal processing application that
exhibits dynamic, scene dependent behavior. For this application
Figure 1: The SPADE flow for executing the case study.
vld_prop_slice
Toutput
predict_pr
as a Kahn Process Network.
writeMB_pr
predictR
isiq_prop_pic
TpredictRD
predict_mv
predict_data
Tpredict predict_ref_mem_id
Port Tstore
store_data
writeMB_prop_seq memMan_prop_seq
MPEG-2 idct_prop_seq add_prop_seq
Video idct_prop_mb add_prop_mb Tadd writeMB_prop_mb TwriteMB memMan_cmd TmemMan
Tisiq Tidct mb_d writeMB_mem_id
Elementary mb_QFS mb_F mb_f
Stream
read write Figure 3: MPEG-2 video decoder modeled as Kahn Process Net-
Trace work.
execute
Highway 32 Highway 64
X Y
64 Architecture model
bytes
Start
Int
32 bytes each
MBH
64 bytes RL symbols Figure 5: Trace driven simulation: the execution of the architec-
MCIpre-fetch
packets control parameters ture model is driven by traces from the execution of the
Iscan_Core
8X8X 16 PFU
IQ_Core
blocks, parameter values can be assigned. For example, for the
interface blocks the number of buffers and their sizes can be
IQDat
a
72 bit data
cmd buffers
specified. A small example architecture model is depicted in
Idct_Core
Idct Fifo
PRU
MB buffers
SU
Figure 6.
Trace
Processor Port
Figure 4: MPEG-2 video decoder architecture.
Exec Unit
In this architecture the VLD unit parses an MPEG stream un-
der control of the CPU. The MPEG stream may have been stored
in Main Memory by the Video-In unit. The VLD alternately pro- I/F
duces a Macro Block Header (MBH) for the MVU and Run
Length symbols with coefficient data for the data driven Iscan / IQ Bus
/ IDCT pipeline. The MVU drives the PFU to fetch prediction
data from memory. The PRU combines the prediction data with
the IDCT output and passes it to the SU in order to be stored in Figure 6: Small example architecture model. The model consists
Main Mem- ory. The Video-Out unit may subsequently retrieve of two processors (the dashed boxes), that are each composed of a
the data from Main Memory. The CPU synchronizes the operation trace driven execution unit and two interface blocks, and a bus.
of the Video- In, VLD, SU and Video-Out units. A relevant
question for this ar- chitecture is whether the latency of the Iscan / For each instantiated processor a repertoire of symbolic
IQ / IDCT pipeline is properly balanced with the latency of the instruc- tions and associated latency values must be specified. An
prefetching of the pre- diction data. applica- tion process can be mapped onto a processor only if the
A methodology for architecture exploration must allow archi- processor can execute all symbolic instructions that may be issued
tecture models to be defined quickly and conveniently. SPADE by the ap- plication process. A specified latency value represents
aims to facilitate the construction of architecture models by the time that the processor needs to execute the fragment of the
offering a li- brary of generic architecture building blocks from application that has been annotated with the corresponding
which architec- ture models can be composed. These building symbolic instruction. These latency values may be retrieved from
blocks are generic in that they are non-functional blocks that can databooks, may be es- timated by expert designers, or may be
be used to model a broad class of programmable or dedicated obtained via off-line com- pilation, simulation or synthesis
components. Upon ex- ecution, the blocks account for the activities. For example, one may run a code fragment through a
consumption of time without processing any real data. This is compiler to determine the latency associated with the execution of
further explained in section 6. that code fragment on a particular embedded processor core.
The generic building block approach is enabled by the trace For the MPEG case study, we obtained the latency values for
driven simulation technique employed by SPADE. With this tech- most of the symbolic instructions from the databook. For the other
nique the application model is executed on top of the architecture symbolic instructions we defined ranges of latency values that we
model, to actually drive the architecture model with data- wanted to explore. Remember that the objective of this work was
dependent traces. As a consequence, abstract non-functional to validate the SPADE methodology and not to validate the TM-
architecture mod- els, built from the generic building blocks, can 2000 MPEG decoder design.
be used for per- formance analysis of architectures running data
dependent appli- cations. The traces transfer information on 5 Mapping
communication and computation operations performed by the
application processes. Specifically, a trace contains an entry for The next step was to define a mapping from the application model
each read, write or ex- ecute call performed by an application (Figure 3) onto the architecture model. Each process from the
process. The principle of trace driven simulation in SPADE is application model is mapped onto a processor in the architecture
illustrated in Figure 5. model. Process ports from the application model are mapped onto
An architecture model is composed from the building blocks. the processor ports to the interface blocks. The mapping of the
This modeling involves the definition of a netlist and requires no pro- cess ports determines the mapping of the channels onto the
additional programming. In SPADE, a processor, be it programma- commu- nication structures in the architecture.
ble or dedicated, is modeled by a single trace driven execution
unit
For the MPEG case the mapping is rather straightforward. The the additional processor. The flat part of the figure is the part
process Tinput is mapped to Video-In and the process Toutput is where all deadlines are met; for a high bus load (small period) and
mapped to Video-Out. Both Thdr and TmemMan are mapped to for a high frame rate deadlines are missed.
the CPU. The remaining processes are mapped one-to-one onto
proces- sors in the dedicated MPEG decoder.
SPADE does not require all channels of an application model to
be mapped. In general, if a channel imposes only a small commu-
nication workload and does not relate to relevant synchronization
delays at the architectural level, then the mapping of this channel
may be omitted without a serious effect on the performance mea-
surements. For the application model of Figure 3, we concluded
for most of the channels that carry sequence or picture properties
that the above condition was satisfied. Hence, these channels were
not mapped. The remaining channels impose a significant
commu- nication workload or contribute to relevant
synchronization among processors, and hence were mapped.
6 Performance Analysis
Several system level architecture simulation frameworks have [7] A.D. Pimentel and L.O. Hertzberger, “An architecture work-
been developed. The Scenic framework [4] allows a designer to bench for multicomputers,” in Proceedings of the 11th Inter-
use C++ to model and simulate mixed hardware-software national Parallel Processing Symposium, Geneva, Apr. 1997.
systems. TSS (Tool for System Simulation) [5] is the Philips in-
house architec- ture modeling and simulation framework. In TSS
an architecture is a network of interconnected modules, which are
modeled using C.