Professional Documents
Culture Documents
Cpu Gpu System
Cpu Gpu System
SECTION : 7
SUBMITTED BY :
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
Motivation:
We were motivated to do our project CPU-GPU system on the basis of understanding how modern-day
machines work on heterogeneity, how APU works. Besides, our instructor’s examples on the topic and
our interest in further research on our upcoming courses inspired us to select this topic.
Heterogenous Computing _ Here to Stay
Introduction:
In modern days heterogenous computing became popular for it’ s flexibility.
Homogenous computing has provided adequate performance for many applications
in the past. Many of these applications had more than one type of parallelism. For
the past few years, most of the applications that have more than one type of
embedded parallelism are being considered for parallel implementation.
Heterogenous computing is well suited for high- performance machines to provide
superspeed processing for computationally demanding tasks with diverse
computing needs. On the other hand, homogenous systems cannot offer the desired
speedups as a result homogenous parallelism in application decreases.
The Best strategy: Making the best use of heterogenous computing can’t be
achieved easily. Fixing a certain language to develop further is a very hard
decision and introducing new language is far harder. OpenCL, MPI are good
candidate for heterogenous computing but both concern reliability in measure of
success. Compilers should be modified to use heterogenous nodes. Operating
systems, computer architects should learn new tricks to adapt. And algorithm level
computation are now cheaper than memory access and data movements which can
be a great way to achieve desired goals.
We use certain benchmark tests to describe the performance of the AMD Fusion
architecture. These benchmarks indicate that the Fusion architecture can solve the
PCIe bottleneck associated with a discreet GPU, but not always. For the first
generation AMD Fusion, the APU, a blended CPU+GPU combination, provides
greater performance than a discreet CPU+GPU combination. Our empiric findings
show that the APU increases data transfer times by 1.7 to 6.0-fold over the discreet
CPU+GPU mix. For one specific benchmark, the Fusion APU's overall run time is
3.5 times higher than on the discreet GPU because the latter has 20 times more
GPU cores and more powerful cores. Improving data transfer times decreases the
parallel overhead, thereby adding more parallelism to the program.
In this case, the processing unit is known as the SIMD engine and includes
multiple thread processors, each having four-stroke cores, along with a special
purpose core and a branch execution unit. The special-purpose (or T-Stream) core
is designed to execute certain mathematical functions in hardware. Besides, the
computing cores are vector processors, ensuring that AMD GPUs can speed up the
content using vector forms. As a result, several threads need to be initiated to keep
all GPU cores completely occupied. However, to run several threads, the number
of registers required per thread must be kept to a minimum. All the registers used
per thread must be registered in the registry file, and thus the total number of
threads that can be set is restricted by the size of the registry file, a generous 256
KB for the current generation of AMD GPUs. AMD GPUs' special architectural
aspect is the inclusion of a rasterizer–for dealing with two-dimensional thread and
data matrices.
The APU (or fused CPU+GPU) does assist in reducing the parallel overhead.
However, due to the presence of computationally less powerful SIMD cores, the
execution time of the parallel part is longer than on the discrete GPU.
Evaluation of successive CPUs/APUs/GPUs
based on an OpenCL finite difference stencil
INTRODUCTION :
The APU memory system is different from a discrete GPU. In an integrated GPU
system not only data can be explicitly copied from the system memory to the GPU
partition and vice versa but also memory objects, known as zero-copy objects, can
be shared between the CPU and the integrated GPU. The part of the GPU
memory that can be exposed to the CPU is called called the GPU persistent
memory.
GPU Read-only zerocopy memory objects are stored in the USWC
(Uncacheable,Speculative Write Combine) memory, an uncached memory in
which data is not stored into the CPU caches and is dependent upon CPU
contiguous write operations utilizing the write combine buffers to increase memory
throughput.In this segment the performance of two successive AMD APUs (family
codename Llano and Trinity) has been evaluated and compared. the different
memory locations of the APUs are refered as a lower case letter: c refers to the
regular cacheable CPU memory for efficient CPU-GPU data transfer, z refers to
the cacheable zero-copy memory objects, u to the uncached zero-copy memory
objects (USWC), g to the regular GPU memory and p to the GPU persistent
memory. In order to test the performance of the read and write accesses to these
buffers,an OpenCL benchmark, a data placement benchmark has been built. It
makes the integrated GPU copy data from an input buffer to an output buffer
stored in two different APU memory location. For both the input and output
buffers, cg (respectively gc) refers to an explicit data copy from the CPU partition
to the GPU partition g (resp. from the GPU partition g to the CPU partition c).z, u
and p refer to the corresponding zero-copy buffer.The results showed in Fig.1 are
for both Llano and Trinity, where init is the input buffer initialization time. iwrite
is the input buffer transfer time to the GPU memory. kernel is the execution time
of the OpenCL kernel, that copies data from the input buffer to the output buffer.
oread is the output buffer transfer time back to the CPU. obackup is the time of an
extra copy from the output buffer to a temporary buffer in the CPU memory in
order to measure the time of reading from the memory location in which the output
buffer is saved. All the tests are run up to 40 time after devices “warm up”. Firstly
the GPU reads from USWC are as fast as GPU reads from GPU memory. CPU
writes to GPU persistent memory are fast but reads are very slow showed in
obackup in Fig.1. Secondly Contiguous CPU writes to USWC (u) offer the highest
bandwidth for init. Finally, GPU memory accesses to z are slower than accesses to
u and g.
In this section, some performance results have been showed for the OpenCL
kernels on each tested device. Additionally, the impact of data placement strategies
on APUs performance has been emphasized too. Finally, the performance of the
integrated GPUs against those of the CPU and the discrete GPUs has been
computed.
1)Devices performance:
Firstly, the performance numbers of each tested device have been showed based on
the kernel execution time only. For this purpose, 3D domains with NxNx32 sizes
(N ranging between 64 and 1024) have been used. Figure 2(a) summarizes the
performance of the different OpenCL implementations on the CPU. We compare
them against an OpenMP Fortran 90 code (compiled and vectorized with Intel
Fortran Compiler).
The OpenCL vector implementations outperform the OpenCL scalar version and
the local vectorized implementation. The vectorized implementation delivers the
best performance being faster than or as fast the OpenMP implementation.Figure
2(c) illustrates the performance numbers of Cayman which shows that the local
vectorized implementation is more efficient than the vectorized one.Figure 2(d)
illustrate the performance numbers of Tahiti which also shows that the local
vectorized implementation is more efficient than the vectorized
implementation.Figure 2(e) illustrate the performance numbers of Llano where it
shows local vectorized implementation and vectorized implementation are
performing almost similar yet local vectorized one is more efficient.Figure 2(f)
illustrate the performance numbers of Trinity in which local vectorized one
performs way too faster than the vectorized one . An important note that for
discrete GPUs, as well as for integrated GPUs, the local vectorized implementation
is more efficient than the vectorized and the scaler implementation performs the
slowest.The performance of Tahiti reaches up to 484 Gflop/s which has been the
fastest performing implementation .
2) Impact of data placement strategies on performance:
Stencil computations are usually used in iterative methods. Data snapshotting is the
process in which between two subsequent iterations, data need to be resident on the
CPU memory in order to be used in further operations. We call this process data
snapshotting. The frequency of data snapshotting which is also known as temporal
blocking can also be an important performance factor. considering the data
placement strategies and the frequency of data snapshotting We run the vectorized
and the local vectorized implementations of the stencil OpenCL kernel on APUs.
one input buffer and one output buffer has been used for this purpose . In Fig.2(g)
and Fig.2(h) we can see the performance results of the kernel ran on a
1024x1024x32 grid, respectively on Llano and Trinity, as a function of the number
of stencil computation passes performed before the snapshot (frequency of data
snapshotting). On LIano vectorized implementation performs closer to the local
vectorized implementation.on the other hand on Trinity local vectorized
implementation is was too faster than vectorized one.We conclude that in order to
obtain the best stencil performance on APUs, we have to use the local vectorized
implementation coupled with the cggc data placement strategy.
3) Performance comparison:
Finally, Fig.2(b) illustrates a performance comparison of the tested devices as a
function of the frequency of data snapshotting, using the best implementation on
each. The domain size of the used grid is 1024x1024x32. The best performance is
always obtained on the Tahiti discrete GPU. A Trinity APU with a unified memory
(trinity (comp-only)) may match the Tahiti performance, and outperform the
Cayman performance, for a snapshot retrieval after every stencil computation. And
it is noticeable that the integrated GPUs always outperform the CPU
implementation.
CONCLUSION :
In this paper the relevance of the APU integrated GPU for high performance
scientific computing by providing a comparative study of the CPU/APU/GPU
performance of an
OpenCL finite difference 3D stencil kernel has been discussed. The
implementations take advantage of the hardware improvements along successive
GPU generations. The new Tahiti discrete GPU delivers impressive performance
(up to 500 Gflops). The two integrated GPUs outperform the CPU for all data
snapshotting frequencies. But, only the integrated GPU of the latest APU (Trinity)
can match discrete GPUs for problems with high communication requirements.
Concerning about performance in both compute power and internal memory
bandwidth, between the latest discrete GPUs and the integrated GPUs in the latest
APUs there is indeed a big gap. For future APUs to be competitive for this kind of
applications (i.e to outperfom discrete GPUs), they should be endowed with more
powerful integrated GPUs and a faster memory system.Besides, we point out that
APUs are low consuming chips compared to discrete GPUs.The removal of the
PCI Express interconnection in the APUs encourages the use of hybrid CPU -
integrated GPU OpenCL implementations for this applicative kernel. From the
tests finally we can conclude that the discrete GPUs in APUs are more compute
powerful and have more internal memory bandwidth than integrated GPUs and the
results indicate that we can achieve very good performance on each tested device
with a set of OpenCL implementations.
CONVOLUTIONAL NEURAL NETWORKS FOR
SELFDRIVING CARS ON GPU
GPU which is short for Graphical Processing Unit is the heart of deep learning, a
part of artificial intelligence. This is a single chip processor used for extensive
graphical and mathematical computations which frees up CPU cycles for other
jobs. Every level in deep learning focuses on how to transform its input data into a
slightly more abstract and composite representation.
The velocity of a GPU relative to a CPU depends on the type of calculation being c
arried out. A computation that can be performed in parallel is the type of computati
on most fitting for a GPU.GPUs are mostly faster at computing than CPUs because
of the bandwidth. The best CPUs have around 50GB/s while the best GPUs have
750GB/s memory bandwidth. The standalone GPU comes with a VRAM memory.
Transferring large chunks of memory from CPU to GPU is a big challenge.
It takes a lot of CPU clock cycles to compute enormous and complex work. The ex
planation is that the CPU sequentially takes up the job and has a lower number of c
ores than its equivalent, the GPU. But though GPUs are faster, the time taken to
transfer huge amounts of data from CPU to GPU can lead to higher overhead time
depending on the architecture of the processors.
Self-driving vehicles are one of the most common ideas when it comes to the
development of technologies. They are in fact far better than the human-driving
cars considering how systematically they are thought to work. A self-driving car is
a vehicle that runs on its own without any need of human interaction or help. It is
considered to accomplish any task that a human driving car can do. Using deep
learning method, a vehicle is taught how to drive with the information collected
using visual data from the cameras installed in the machine. The problem however
lies in processing the data in time for which CNN Convolutional neural network) is
used to train the model. CNNs are used in GPUs to make this happen.
For over twenty years, Convolutional Neural Networks have been used for commer
cial purpose. On massively parallel graphics processing units (GPUs), CNN
learning algorithms can now be introduced, speeding up learning and performance
considerably.
A person drives a car. It has cameras for data collection. The result is a set of
frames, for each there is a value of the angle of rotation at that time.
Under various weather conditions, lighting and time of day, data is obtained. They
are sorted on the basis of the conditions given above.
We extend the dataset to train a network to get out of a
bad situation by introducing artificial disturbances.
However, the problem lies in processing the data in a short period of time. This is
where a GPU is chosen over a CPU as it is better suited for machine learning for its
technical features that help them perform large amount of data in a short time. The
GPU memory although does not provide a big volume so heterogenous
environment is used where the graphical accelerators help with calculations during
training and testing data and the analysis is then made on CPU. A graphics
accelerator is a computer microelectronics component (a chipset attached to a
video board) to which a computer program can offload the sending and refreshing
of images. Special effects typical to 2-D and 3-D images on the display monitor
and the measurement.
GPUs are currently being termed as the new CPUs in the time of cutting-edge
technologies. GPU has come out as the most dominant chip architecture for self-
driving technology because it is cost efficient and for its popularity. In terms of
creating GPU-powered AI platforms and teaming up with well-known automotive
giants, Nvidia, the world's leading GPU manufacturer, has been scoring major
wins. The US-based Nvidia recently launched a new version of the platform called
Pegasus after the launch of its original AI-based supercomputer platform Drive PX
in 2015, which can be utilized to power Level 5 autonomy. Completely
autonomous cars without pedals, steering wheels or mirrors can be supported by
this new platform. It has been built on Nvidia's CUDA GPUs which has intensified
its computing speed by 10 times and lowered the power consumption by 16 times.
Given the instances of the construction of new products by GPU manufacturers,
particularly Nvidia, it can certainly be argued that the criticality of GPU-powered
AI platforms in the efficient implementation of autonomous vehicles. Many more
such technologies are in the works and will accelerate the production of big data
systems powered by AI, in which GPUs will play a key role in the growth of AI.
Final conclusion:
For accomplishing elite from CPU-GPU Processors numerous means were taken in the
light of engineering progression in plan, computational strategies and advancement. We
will initially take a gander at the condition of craft of GPU-CPU structures and attempt to
comprehend CPU-GPU plan. Next, we are zeroing in on significant improvements in every
aspect of CPU-GPU Processors as indicated by research considers. We will at that point
look at the collaboration between CPU-GPU and their execution conduct. At last, we will
inspect future for CPU-GPU frameworks.