Design of Graphics Processing Framework On FPGA

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

IEEE International Conference On Recent Trends In Electronics Information Communication Technology, May 20-21, 2016, India

Design of Graphics Processing Framework on


FPGA
Ramanathan S G, Pradeep Kumar B, C M Ananda, Jayakumar E P

Abstract— High performance graphics processing is a on processing graphical parameters to generate 2D/3D effects
challenging task in embedded system domain. Field on the projected displays. GPU design is highly optimized to
Programmable Gate Array (FPGA) attracts a great interest in perform this complex graphics processing on 3D polygons to
building Graphics Processing Unit (GPU) framework on its generate motion pictures on screen.
platform due to its feasibility and configurable resources. This
In past generation, there was less importance for graphics
paper provides design perspectives of GPU framework and its
building blocks to realize GPU functionality on FPGA to draw
processing in computer systems. CPUs were used all the way
base primitives. In this proposal, triple video display buffers has to handle system’s internal processing along with graphics
been designed and implemented for buffering entire video frames processing. The technology mandates for dedicated graphics
in cyclic manner. The GPU framework IP core for the proposed processing. CPU has been offloaded from graphics processing
approach has been designed and developed using Xilinx tools and to boost the overall system performance.
has been implemented and tested using a Xilinx SP605 FPGA GPU has been evolved to offload the graphics processing
evaluation board. The IP Core is commanded from host system. capabilities from CPU to avoid the compromise in system
Hardware Description Language (VHDL) has been used for speed due to complex graphics processing. It is optimized in
design of GPU framework. This paper provides design and terms of instructions sets to handle graphical data and
implementation details of basic building blocks required to processing. So CPU and GPU entirely varies in terms of
achieve GPU functionalities on FPGA. instruction set architecture. In a way it helps in accelerating
the video processing capabilities.
The video/image processing like motion pictures, video
Keywords— GPU, Video Frame Buffers, FPGA, DDR3 editing, gaming, animation works or anything deals with
Memory Controller, VHDL graphics portion of system processing is handled by GPU. It is
integrated with Video RAM, which helps to buffer the
I. INTRODUCTION computed output of GPU and transfers to screen. CPU and
GPU works together to make the graphics processing much
GPU is a Unit, processes graphical information required simpler. It acts as co-processor in the motherboard to
for the display systems such as primitives, images, videos, accelerate graphics processing capabilities of CPU.
etc., GPUs are very much similar to CPU but the design cores GPU was initially assigned to perform the rendering of
in GPU are optimized to handle images and video information. polygons rectangles, lines, etc., and populating character
The calculation and computation involves complex math on pixels on the screen. Later, the functionality has been
given input parameters to display graphics information. increased to perform operation like movement of objects,
Graphics Processing Unit (GPU) is designed to perform sequence of graphics operation, filling of object with colors
highly mathematical computation on multiple graphical input etc., and 2D accelerator has been implemented in GPU using
data in parallel. The input data is graphics primitives, such as fixed function graphics pipeline for graphics processing. Later
polygon structures which are of floating point unit type. The 3D hardware accelerator has been implemented with dynamic
processing capabilities of GPU are mostly applying function graphics pipeline to meet the graphics animation and
trigonometric on input graphics primitive vertices. gaming console demands in the market.
Computation involved inside GPU is highly parallel and Advanced GPU comprises of multiples of cores to perform
complex. The real time graphics processing of motion pictures compute-intensive functions whereas CPU has multi-core to
requires highly accurate and precise algorithms in it to make perform sequential serial operation to form a heterogeneous
the smooth transition of motion pictures. system. GPU processes many vertices in parallel using same
GPU functionalities involve mathematical calculation program i.e., SIMD (Single Instruction Multiple Data). In turn
which is beyond the basic mathematics which can be it processes given primitives at the same time to create a scene
performed well by CPU. GPU provides highly computational on the screen.
processing capabilities compared to CPU. Thus there is a huge II. BACKGROUND
difference in computation capabilities across CPU and GPU
A. Graphics Pipeline
Ramanathan S G: Aerospace Electronics and Systems Division, CSIR
National Aerospace Laboratories, Bangalore, India (ram_ald@ nal.res.in) A number of studies attempted towards the understanding
of GPU underline architecture [1] to develop the design frame
Pradeep Kumar B: Aerospace Electronics and Systems Division, CSIR work of GPU on FPGA. Typical Graphics pipeline [4] of GPU
National Aerospace Laboratories, Bangalore, India is represented in Fig. 1. The GPU functions lies between input
(pradeepkumar@nal.res.in) commands to frame memory buffer. The efficiency of any
C M Ananda: Aerospace Electronics and Systems Division, CSIR National
GPU depends on algorithm developed for following functions
Aerospace Laboratories, Bangalore, India (ananda_cm@nal.res.in ) [2]:

Jayakumar E P: Dept. of Electronics & Communication Engg. National  Geometry Engine


Institute of Technology, Calicut, Kerala, India (jay@nitc.ac.in)  Rasterization Process (Lighting, Shadows, Colors etc.,)

978-1-5090-0774-5/16/$31.00
Authorized licensed use limited to:© 2016 IEEE DE SAO PAULO. Downloaded on May 12,2024 at 06:31:22 UTC from IEEE Xplore.
UNIVERSIDADE Restrictions apply.
387
IEEE International Conference On Recent Trends In Electronics Information Communication Technology, May 20-21, 2016, India

 Texturing The basic building blocks of GPU framework design are


 Fragmentation Rasterization, Memory controller, Tri-Display Buffers in
DDR3, Video Graphics Controller Design. Host application to
drive GPU blocks on FPGA by generating the instructions.
Geometry Engine [3] works on primitives vertices to The implemented GPU basic design building blocks inside
transform 3D points into 2D projected space on the screen. FPGA is represented in Fig. 2 and explained below:
The primitive coordinates whichever crosses the projected
screen location has been clipped before performing A. Host Application
Rasterization process. Rasterization process computes the
pixel mapping to illuminate pixels for given primitives Host application runs on CPU provide commands to draw
vertices. In addition to rendering pixels it also performs the primitives and its color to GPU framework on FPGA.
lighting, shadowing, color of the objects which requires the Instructions from host application are screen coordinates,
additional properties of the object which are instructed to color, clear screen, flush and length of data. Flush command is
draw. Texturing is the process of wrapping the 3D object with used to flush entire current frame to next frame. Whenever
2D patterns or color shades on top of given object. Fragment entire frame drawing process is completed, flush command is
is the process of generating the necessary pixel which are issued from the host machine to perform flush operation.
worth to draw primitives into frame buffers. Graphics pipeline Issuing flush command is based on the application
performs execution of same instruction on different data at requirements which create command to display graphics
various instance of time. primitives. FPGA decodes these instructions before
processing. The packet structure of instruction from host
Commands Evaluators application to FPGA is shown in Fig. 3:
from CPU Transform
Geometry Vertex Lighting
Engine Program Screen_X_Coordiante Screen_Y_Coordiante Color Length of Data Flush/CLR
Text
Graphics Pipeline

Rasterization Coordinates
Process
Clipping
Fig. 3: Frame Structure of Host’s Instruction to FPGA
Texturing

B. Rasterization Logic
Fragmentation
The Rasterization works on the primitive’s vertices
available on instruction packet structure.
Frame Buffers Rendering/Rasterization logic is to compute the memory
address of DDR3. In-turn the memory address are mapped [8]
Display Screen to projected screen location. In this design, used screen
resolution is XGA (1024 x 768). In case provided coordinates
Fig. 1: Graphics Pipeline Blocks are out of projected screen resolution then design logic ensures
that all those regions are clipped. The mapping between screen
The advanced graphics processors are designed to have ‘n’ coordinates and absolute screen location in X and Y direction
graphics pipelines as cores in parallel to each other to handle is represented in Fig. 4:
multiple data at the same time. On processing video frame,
multiple objects and primitives with different coordinates are -1, +1 (0, 0) 0, +1 (511, 0) +1, +1 (1023, 0)
handled in parallel. The efficient graphics processing unit
processes all such inputs in parallel and generates minimum of
+1, 0 (1023, 383)

24 frames per second to write into frame buffers.


-1, 0 (0, 383)

III. GPU BLOCKS ON FPGA


0, 0 (511, 383)
FLUSH Command

Flush
Buffer Video Display
HOST Rendering/ DDR3
Application Rasterization Memory Logic Graphics Station -1, -1 (0, 767) 0, -1 (511, 767) +1, -1(1023, 767)
on CPU Logic Controller (XGA
Controller Resolution) Fig. 4: Coordinate and Absolute Location in Screen
Read Latest Design
Display
Buffer The formulae used to compute absolute screen location
from given screen coordinates of design resolution is given in
Operational GPU on FPGA equations (1) and (2). & are absolute locations;
Video Frame Buffers

Buffer
& are screen coordinates
Display
Buffer#1    
Display
Buffer#2 (2)
Display
Buffer#3 The absolute screen location has to be offset in case the
DDR3 Memory
required primitive’s size is more than two, so that primitive
looks center in the screen’s given coordinates. The formulae
Fig 2: GPU Blocks on FPGA for offsetting the computed absolute screen location are given

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:31:22 UTC from IEEE Xplore. Restrictions apply.
388
IEEE International Conference On Recent Trends In Electronics Information Communication Technology, May 20-21, 2016, India

in equations (3) and (4). & are offset absolute followed by command in a regular sequence as depicted in
locations; is given dot size to draw. Fig. 8 flowchart.
(3)
(4) S TART

The DDR3 memory address computation for offset screen


location is given in equation (5). is DDR3 Initialize all
memory location. Counters/
Register
(5)

Memory
Rasterization process provides address and Not
Calibration
color data to draw into operational memory buffer. The same Completed
S tatus
color is used to illuminate pixels on the screen.
Completed

C. DDR3 Memory Controller S end Command to the


Controller for Reading
The memory controller core [11] serves the purpose of Data
accessing DDR3 memory on SP605 evaluation board for read
and write operation along with arbitration logic on configured
ports. Entire screen resolution pixels are mapped into DDR3 Controller’s
memory location for video buffering. Entire XGA resolution is Data FIFO Not
Full FULL
mapped to 3072KB of DDR3 memory region
(1024*748*4Bytes) in which each pixel data is of 4 bytes. A FULL
pixel color is stored in DDR3 for pixel illumination. DDR3
memory mapping is shown in Fig. 5. Not Read Data FIFO Content
Finished

A Pixel in the Screen Increment Address to


Next Location

RG Read
Finish
B S tatus
Finished

S TOP

Fig. 6: DDR3 Memory Read Operation Flow Chart

Fig. 5: Display Resolution mapped with DDR3 Memory

Rasterization design updates each pixel’s memory


location. Each pixel comprises of three prime colors (RGB).
The interface commands used for memory controller are
Number of words to read/write, Instruction for both read/write
and Memory address for both read/write operation. The flow
chart to perform read & write operation are shown in Fig.6
and Fig.8 respectively. DDR3 memory read operation [12] is
captured in Xilinx chip scope and shown in Fig. 7, also notice
that command is followed by data in a regular sequence as
depicted in Fig. 6. DDR3 memory controller performs
memory calibration operation at the beginning of both read or
writes operation.
The difference between read and write operation is the
sequence of data and command as shown in Fig. 6 & Fig. 8. Fig. 7: DDR3 Memory Read Operation Waveform
While reading, the design logic sends command first to read
before the data. But while writing, design logic sends D. Triple Display Buffer Design
command to write after writing data. In turn memory
controller takes care of writing/reading process through Buffers are nothing but the memory regions (size 3072KB)
assigned ports. Full status and Empty status of controller’s in DDR3. Full frame content has been buffered in 3 display
data FIFO will be checked before reading and writing buffers using triple buffer mechanism [5] before displaying on
respectively. DDR3 write operation [12] is captured in Xilinx screen. Rasterization logic draws direct into operational buffer
chip scope and shown in Fig. 9, also notice that data is based on the instructions from host application.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:31:22 UTC from IEEE Xplore. Restrictions apply.
389
IEEE International Conference On Recent Trends In Electronics Information Communication Technology, May 20-21, 2016, India

START
Rendering Logic Port A Operational Buffer
Memory Region
Initialize all
Counters/ Port B
Register Read
Flush Buffer On Flush Display Buffer1
Trigger Latest
Logic Port D
Memory Port C
On Display
Not Calibration Display Buffer2
Completed Status
Vsync_Trigger Buffer
Completed
Display Buffer3
Controller’s
Not
Data FIFO
Empty
Empty

Empty Fig. 10: Tri-Display Buffer Blocks


Write into Data FIFO
The total memory buffers size required to build such
buffering scheme is 12MBytes in DDR3 memory which
Send Command to
Not Controller for writing to
includes one operational buffer and three display buffers.
Finished DDR3

E. Video Graphics Controller Design


Increment Address to
Next Location Horizontal, vertical synchronization and data enable
control signals along with RGB color data are generated by
Write
Finish
Video graphics controller design [7] for XGA resolution. The
Status pixel clock frequency [6] is mainly depend on refresh rate of
Finished the display system and its resolutions. In this design, the
STOP monitor refresh rate is 60Hz. The standard pulse count for
generation of control pulses as per VESA standard [9] are
Fig. 8: DDR3 Memory Write Operation Flow Chart shown in Table I:

TABLE I. STANDARD PULSE COUNTS FOR XGA RESOLUTION


Scan Visible Front Sync Back Whole Line/
Line/Frame Area Porch Pulse Porch Frame
Pixels 1024 24 136 160 1344
Lines 768 3 6 29 806

Equation (6) shows pixel clock frequency computation


formula. is the pixel clock in MHz, is
Fig. 9: DDR3 Memory Write Operation Waveform number of horizontal scan line, is number of vertical
lines and is display refresh rate.
Tri-Buffer operational blocks shown in Fig 10. Each
(6)
DDR3 buffer regions are referred by its base address for read
and write operation. Various ports are configured using MIG 65MHz (1344 * 806 * 60), is the pixel clock required for
tool. Rendering logic is meant to draw the frame contents only video graphics controller design. Pixel color stored in the
into operational buffer through PORT A. PORT B is used to buffer memory location gets read by video controller design
read content of operational buffer and PORT C is used to write sequentially and populate on the visible area of the screen.
that content into display buffer if host application issues
FLUSH command. The flush operation using PORT B and C IV. ADDRESS CALCULATION FOR DRAWING DOT IN SCREEN
is carried out by Flush Buffer Logic in the design. The choice
of display buffer to flush/copy the operational buffer content is Various dot sizes are considered below as samples in this
in cyclic manner by design. paper to show the memory address computation and then
followed by the experiment set-up and results on the screen
Read latest display buffer block reads display buffer’s which displays various dot with different sizes in different
frame content through PORT D and push it to the output FIFO location of screen as shown in Fig. 14. Case-1 and case-2
after the reception of start frame sync pulse from video provides computation involves for rendering dot primitives of
controller. The choice of display buffer to read is based on the different sizes of 2 and 5 respectively. The computation of
latest updated display buffer sent by Flush Buffer Logic these numerical values derived through simple substitution of
design. The process of drawing frame into operational buffer, given dot size and & coordinates (0, 0) in equations
flushing entire frame content into display buffers and (1) through (5).
displaying latest display buffer’s content is a sequence of
continuous fast operation in cyclic manner to display the A. Case 1:DOT Size = 2:
motion picture onto the screen.
The computed ( , ) coordinates are (511,383). For
The implementation and realization of triple display buffer the given dot size = 2, two rows starting addresses
[8] in this GPU framework design is to improve the efficiency are computed and shown in Table II. The same
of graphics processing capabilities through higher frames per is represented in Fig. 11 where four got pixels illuminated.
second.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:31:22 UTC from IEEE Xplore. Restrictions apply.
390
IEEE International Conference On Recent Trends In Electronics Information Communication Technology, May 20-21, 2016, India
TABLE II. COMPUTED NUMERIC FOR DOT SIZE = 2 VI. SUMMARY/CONCLUSIONS
Basic GPU frame work blocks design on FPGA has been
510 382 391678 presented in this paper where rendering and frame buffering at
2 510 383 392702 the input side and video graphics controller design at the
output side were addressed. At present dot primitive is
addredded as part of rendering logic in input side. This design
1024 prototype shall become a baselined framework to build rest of
the GPU functionalities. Implementation of triple-display
391678
buffer in this platform shall improve the overall efficiency of
392702 1024 * 768 GPU system. VHDL has been used for overall design and
Display implementation of this prototype system. In future, the same
platform shall be used for addressing Anti-Aliasing effect,
Blending, Overlay of additional information, Multi-layering
768 concepts etc.,
Fig. 11: DOT Size = 2 in Display Unit
ACKNOWLEDGMENT
B. Case 2:DOT Size = 5:
This work was supported in part by the Council for
The computed ( , ) coordinates are (511,383). For
Scientific & Industrial Research (CSIR), National Aerospace
the given dot size = 5, five rows starting addresses are
computed and shown in Table 3. Laboratories (NAL). The authors gratefully acknowledge Mr.
Shyam Chetty, Director, CSIR-NAL, Bangalore for
continuous support and motivation.
TABLE I. COMPUTED NUMERIC FOR DOT SIZE = 5
REFERENCES
509 381 390653 [1] J. George Cherian Panappally, Dhanesh M.S “Design of Graphics
Processing Unit for Image Processing,” IEEE’14, Dec. 2014
509 382 391677
5 [2] Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan and
509 383 392701 Yuan Xie “3D GPU Architecture using Cache Stacking: Performance,
509 384 393725 Cost,Power and Thermal analysis,” IEEE’09.
509 385 394749 [3] Virginie FRESSE, Dominique HOUZET and Christophe GRAVIER
“GPU architecture evaluation for multispectral and hyperspectral image
analysis,” IEEE’10.
V. EXPERIMENTAL SETUP AND RESULTS [4] Matt Pharr, Randima Fernando, “GPU Gems 2: Programming
Techniques for High-Performance Graphics and General-Purpose
GPU framework design blocks has been designed, Computation”, Addison-Wesley Professional, 2005.
integrated and implementation on FPGA SPARTAN 6
[5] Shujjat Khan, Donald Bailey and Gourab Sen Gupta “Simulation of
XC6SLX45T [10] and evaluated on SP605 Evaluation board Triple Buffer Scheme,” IEEE’09.
[13]. The design utilized Xilinx IP blocks such as FIFO, [6] Van-Huan Tran, Xuan-Tu Tran “An Efficient Architecture Design for
Memory controller, clock generator etc., in XILINX ISE 14.7 VGA Monitor Controller,” IEEE’11.
tool. The evaluation board provides differential clocks of [7] Guohui Wang, Yong Guan “Designing of VGA Character String
200MHz to the FPGA. The clock generator IP blocks derive Display Module Base on FPGA,” IEEE’09.
the required clocks for its internal block operation. [8] Jong Won Park “An Efficient Buffer Memory System for Subarray
Access,” IEEE’01.
Display driven by FPGA
[9] VESA and Industry Standards and Guidelines for Computer Display
Monitor Timing, VESA Standard DMT v1.0, r11, 2007
[10] Xilinx, “Spartan-6 FPGA Family: Complete Data Sheet”,
www.xilinx.com
[11] Xilinx, Using Memory Controller in Spartan-6 FPGAs, Four-Port
Memory Controller Core v3.92. www.xilinx.com.
[12] Xilinx, Using Memory Controller in Spartan-6 FPGAs, Memory
Controller User Guide UG388 v2.3. www.xilinx.com.
[13] Xilinx, Using SP605 Hardware User Guide, UG526 v1.8.
www.xilinx.com.

Host System FPGA SP605 Evaluation Board


Fig. 12: Evaluation Board Set-Up and Results

As shown in Fig. 12, host system interfaced with FPGA


SP605 Evaluation board and GPU design block drives LCD
display. Host system commands to draw number of dots of
different sizes at different locations of the screen in a single
frame and followed by a flush command.
All dots has been drawn in different locations of
operational buffer and flushed into display buffer when flush
command is issued from host machine. The flushed content
gets read and displayed onto the screen. The display output
with various dots in the screen is shown in Fig. 12.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:31:22 UTC from IEEE Xplore. Restrictions apply.
391

You might also like