Cpu Gpu System

COURSE : CSE332
GROUP : Rifat & Co.
SECTION : 7
PROJECT NAME: CPU - GPU SYSTEM
SUBMITTED BY :
N M RIFAT Rahman, ID- 1812817042
ANIK RAIHAN APURBA, ID-1813424042
NOOR JAHAN JEDNE, ID- 1812108642
SHUKANNA ABEDIN, Id- 1712706642

TABLE of CONTENTS
Heterogenous Computing _ Here to Stay

 Limitations of Homogenous computing & benefits of Heterogenous computing
 Types of Heterogenous computing
 Hardware and software challenges
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
 Overview of AMD Gpu

 PCIe Bottlenecks with Discrete GPUs
 AMD FUSION ARCHITECTURE
Evaluation of successive CPUs/APUs/GPUs

based on an OpenCL finite difference stencil
 OPENCL AND ARCHITECTURES OVERVIEW
 DATA PLACEMENT STRATEGIES
 FINITE DIFFERENCE STENCIL OPENCL IMPLEMENTATIONS
 Impact of data placement strategies on performance
 Performance comparison
CONVOLUTIONAL NEURAL NETWORKS FOR SELFDRIVING CARS ON GPU
Motivation:
We were motivated to do our project CPU-GPU system on the basis of understanding how modern-day
machines work on heterogeneity, how APU works. Besides, our instructor’s examples on the topic and
our interest in further research on our upcoming courses inspired us to select this topic.
Heterogenous Computing _ Here to Stay
REVIEWED BY: N M Rifat Rahman, ID- 1812817042
Introduction:
In modern days heterogenous computing became popular for it’ s flexibility.
Homogenous computing has provided adequate performance for many applications
in the past. Many of these applications had more than one type of parallelism. For
the past few years, most of the applications that have more than one type of
embedded parallelism are being considered for parallel implementation.
Heterogenous computing is well suited for high- performance machines to provide
superspeed processing for computationally demanding tasks with diverse
computing needs. On the other hand, homogenous systems cannot offer the desired
speedups as a result homogenous parallelism in application decreases.
Limitations of Homogenous computing & benefits of Heterogenous

computing:
Conventional homogeneous systems usually use one mode of parallelism in a

given machine and thus cannot adequately meet the requirements of applications
that require more than one type of parallelism. As a result, any single type of
machine often spends its time executing code for which it is poorly suited.
Moreover, many applications need to process information at more than one level
concurrently, with different types of parallelism at each level which became a great
problem. Also, each type of homogeneous system suffers from inherent limitations.
Heterogeneous computing, the practice of combining multiple processing elements
to form hybrid high-performance computing systems, enables to reap the benefits
of multiple types of system technologies. Computer processing units (CPUs) are
well suited to handling sequential workloads and benefit from a large existing code
base; the drawback is power consumption and the latency involved in getting data
to be processed. Graphics processing units (GPUs) are better at processing vector
data and parallel processing, but at the cost of higher power consumption. Field-
programmable gate arrays (FPGAs) require lower power and are low latency, but
more money and higher complexity are involved in system designs. Creating
hybrid systems to benefit from the best characteristics of each (FPGA, GPU, and
CPU), including: low latency, power efficiency, attractive performance per dollar,
longer product life, customization, and the efficient use of diverse processors. The
future of embedded computing is hybrid processing architectures
Types of Heterogenous computing: Heterogenous computing have different

computing nodes with different ways of executing program. It can be divided into
multiple sections. Cores of same capabilities (superscalar width, vector arithmetic,
hyperthreading) have heterogeneity because of DVFS (dynamic voltage and
frequency scaling). A core doing more work became warmer and slow and creates
heterogeneity among cores with same capabilities. Cores with different
architectural capabilities is another type of heterogeneity. Computing nodes having
different types of executing models, the GPU (Graphics processing Unit) is used in
many applications beside graphics is another example of heterogenous computing.
The GPU if flexible on large amount of data. Another illustration of heterogeneity
that budge from traditional computation is the FPGA (Field Programmable Gate
Array). The FPGA tries to close the gap between hardware and software as
hardware solutions are faster but inflexible.
Hardware Challenges: A lot of problem lies implementing heterogenous
computing system in hardware level. Memory system creates bottleneck as it had
not developed like processors which followed Moore’s law, making a great leap in
performance within past few decades. Shared memory hierarchy makes this
problem more challenging. It deals with the different requirements of each
computational cores and reduces the interference among them.
The memory system is a nontrivial source of power consumption. Also considering

main memory’s performance issues implementing shared memory hierarchy is
very challenging. Distributing workload among existing cores within the lowest
power consumption and interconnection between hardware with present
knowledges are difficult.
Software Challenges: Programming heterogenous systems come up with different
aspects of problems in software level. Considering power and efficiency parallel
programming turned out to be very difficult. Many aspects of the hardware needed
to be hidden from the programmers to increase productivity. Programmers need to
make decisions, how to decompose the application into threads suitable for the
hardware at hand, and which parts of the program do not require high performance
and can be executed in lower-power-consumption mode.
The measures of success can be achieved by determining the margins of

progression. The first and foremost is performance. Lower speed means lacking
behind. Secondly, scalability is the property of a system to handle a growing
amount of work by adding resources to the system. Adding more cores of the same
type, GPU or FPGAs can achieve scalability. Lastly, portability and reliability are
the two factors to be considered. Hardware can be faulty as the transistors are
getting smaller. Two versions of the same subroutine can attain reliability to some
extend. Portability should be ensured if programmers write a program for public to
use on many different devices.
The Best strategy: Making the best use of heterogenous computing can’t be
achieved easily. Fixing a certain language to develop further is a very hard
decision and introducing new language is far harder. OpenCL, MPI are good
candidate for heterogenous computing but both concern reliability in measure of
success. Compilers should be modified to use heterogenous nodes. Operating
systems, computer architects should learn new tricks to adapt. And algorithm level
computation are now cheaper than memory access and data movements which can
be a great way to achieve desired goals.
Conclusion: In modern days where we depend on the technologies mostly, we

tend to rely on fast and reliable machines to make our life easy. Arise of
necessities, heterogenous computing have the potentials. But considering all of the
thresholds, achieving heterogenous computing tends to be difficult.
In this paper, different types of heterogeneity of computation, it’s challenges and

overcoming those challenges were discussed. Introducing GPUs, FPGAs with
processor resulted in great computation power. Implementing these ideas became
problem when difficulties in different sectors arose. Bringing out the full potential
a heterogenous machine will require time and effort.
On the Efficacy of a Fused CPU+GPU

Processor (or APU) for Parallel Computing
REVIEWD BY : ANIK RAIHAN APURBA , ID-1813424042
The worldwide usage of compute-capable graphics processing units (GPUs) in

desktops and workstations has made them attractive as accelerators for high-
performance parallel computing.
Their increased popularity has been partly due to a unique combination of
performance, power, and energy efficiency. It has been stated that 3 out of 5
supercomputers are GPU-based.
However, since the GPU has existed on PCIe as a separate unit, the GPU
applications' output can be filtered by data transfer between the CPU and the GPU
over the PCIe.
With the advent of heterogeneous computing architectures that "mix" the Processor
and GPU features to the same as AMD Fusion and Intel Knights Ferry, PCIe
bottlenecks' expectations will be resolved. This paper would address the
characterization and examination of the efficacy of the AMD Fusion system.
The processor-based on the Fusion architecture is called the Accelerated
Processing Unit or the APU. The APU integrates the CPU's general-purpose x86
cores with the GPU's programmable vector-processing engines on a single silicon
die.
We use certain benchmark tests to describe the performance of the AMD Fusion
architecture. These benchmarks indicate that the Fusion architecture can solve the
PCIe bottleneck associated with a discreet GPU, but not always. For the first
generation AMD Fusion, the APU, a blended CPU+GPU combination, provides
greater performance than a discreet CPU+GPU combination. Our empiric findings
show that the APU increases data transfer times by 1.7 to 6.0-fold over the discreet
CPU+GPU mix. For one specific benchmark, the Fusion APU's overall run time is
3.5 times higher than on the discreet GPU because the latter has 20 times more
GPU cores and more powerful cores. Improving data transfer times decreases the
parallel overhead, thereby adding more parallelism to the program.
Overview of the AMD GPU

AMD GPU embraces a classic graphic style that is finely calibrated for single-
precision, floating-point arithmetic and typical two-dimensional image and image
operations.
In this case, the processing unit is known as the SIMD engine and includes
multiple thread processors, each having four-stroke cores, along with a special
purpose core and a branch execution unit. The special-purpose (or T-Stream) core
is designed to execute certain mathematical functions in hardware. Besides, the
computing cores are vector processors, ensuring that AMD GPUs can speed up the
content using vector forms. As a result, several threads need to be initiated to keep
all GPU cores completely occupied. However, to run several threads, the number
of registers required per thread must be kept to a minimum. All the registers used
per thread must be registered in the registry file, and thus the total number of
threads that can be set is restricted by the size of the registry file, a generous 256
KB for the current generation of AMD GPUs. AMD GPUs' special architectural
aspect is the inclusion of a rasterizer–for dealing with two-dimensional thread and
data matrices.
PCIe Bottlenecks with Discrete GPUs

We're showing the source of PCIe bottlenecks with a discreet GPU. As it is seen,
the x86 host CPU can access the system memory and operate GPU functions.
However, when the GPU resides on the PCIe, DMA is necessary to transfer data
from the CPU's machine memory to the GPU's application memory to do some
useful work. While the GPU can perform hundreds of billions of floating-point
operations per second, the existing PCIe interconnects can only pass roughly one
gigaword per second. Due to this constraint, the GPU application programmer
must ensure the GPU's high data reuse to efficiently amortize the cost of inefficient
PCIe transfers and gain significant performance benefits.
However, it may not be possible to ensure a high level of reuse of data in all
applications. For example, there may be applications for 142, which run time on a
GPU that's less than the time it takes to get the results to a GPU or applications
whose execution profiles consist of iteration over DMA transfers and GPU
execution. Discrete GPUs may not be suitable for such applications to enhance
performance. Emerging architectures such as AMD Fusion aim to address these
concerns by removing PCIe links to the GPU by "fusing" CPU and GPU features
to a single silicon die.
AMD FUSION ARCHITECTURE
The Fusion architecture incorporates the general-purpose scalar and vector
processor cores on a single silicon die at the most fundamental level , creating a
heterogeneous computing processor. The goal is to have the "best of all worlds"
scenario in which scalar workloads, such as word processing and web surfing, use
x86 cores, while vector workloads, such as parallel data processing, use GPU
cores.
The key point that needs to be recalled is that x86 CPU cores and the vector
(SIMD) engines are connected to the system memory using the same high-speed
bus and memory controller. This architectural artifact allows the AMD Fusion
architecture to ease the fundamental PCIe restriction with historically minimal
performance on a discreet GPU. AMD is currently offering high-speed block
transfer engines that pass data between the x86 and SIMD memory partitions. As a
result, the Fusion architecture is promising to boost performance for all
applications that have traditionally been bottled by PCIe transfers.
For the discrete GPU, however, the parallel overhead is so significant that it is
more than the sum of execution times of the serial and parallel parts
The APU (or fused CPU+GPU) does assist in reducing the parallel overhead.
However, due to the presence of computationally less powerful SIMD cores, the
execution time of the parallel part is longer than on the discrete GPU.
Evaluation of successive CPUs/APUs/GPUs
based on an OpenCL finite difference stencil
REVIEWD BY: Noor Jahan jedne Id : 1812108642
INTRODUCTION :
The AMD APU (Accelerated Processing Unit), a series of 64-bit microprocessors

from Advanced Micro Devices architecture,combines CPU(Central Processing
Unit) and GPU(Graphics Processing Unit) cores on the same die .The first APU
generations and integrated GPUs in APUs have different CPU and GPU memory
partitions.But the new AMD APU architecture eliminates the PCI Express bus
which bottlenecks many GPU applications. Currently, the APU integrated GPUs
are also less compute powerful and have a lower memory bandwidth than discrete
GPUs. In recent years Graphics Processing Units (GPU)s have developed their
huge compute power and high internal memory bandwidth and have been able to
provide better performance than standard CPUs.But Applications with high CPU-
GPU communication requirements are bottlenecked by the PCI Express transfer
rate (about 6 GB/s for a PCIe 2.0 x16 bus). In this paper the performance of two
successive AMD APUs (family codename Llano and Trinity), two successive
discrete GPUs (chip codename Cayman and Tahiti) and one hexa-core AMD CPU
have been evaluated and compared to investigate the performance of APUs . For
this purpose, a 3D finite difference stencil optimized and tuned in OpenCL has
been used. The main focus of this paper is to benchmark the first generations of
APUs and compare their performance against those of two successive generations
of GPUs and of a CPU using an OpenCL (Open Compute Language) kernel : a
single precision finite difference 3D stencil which is memory bound. Finite
difference stencils are widely used in numerical simulations codes in a wide range
of applications such as depth imaging, computational fluid dynamics and
electromagnetics.To determine whether the integrated GPU can outperform
discrete GPUs and CPUs depending on the data snapshotting frequency, relevant
data placement within an APU as well as highly efficient applicative kernelson
each architecture is required.
OPENCL AND ARCHITECTURES OVERVIEW :
OpenCL (Open Computing Language) is an open standard platform for writing

general purpose parallel programs of heterogeneous systems consisting of central
processing units, graphics processing units, digital signal processors, field-
programmable gate arrays(FPGAs) and other processors or hardware accelerators.
It supports both data-parallel and task-parallel programming models.It makes much
faster computations possible.To execute scalar or vector instructions the processing
elements use Single Instruction, Multiple Data (SIMD) units. A function that
executes on OpenCL devices is called OpenCL kernel, every instance of which is
called workitem. It executes on a processing element(PE), simultaneously with
other work-items available on the same device, and operates on an independent
data set that may be stored in different types of memory: global memory, local
memory and private memory. Work-groups executes on the same compute unit by
groups of 64 work-items called wavefronts. AMD GPUs are evolving along
generations and the way they are structured varies with the device family. The
Evergreen and Northern Island GPUs, present a vector design that requires vector
instructions to reach peak performance and the Southern Island GPUs, requires a
scalar design (in reality it is a dual scalar/vector hardware design).For most AMD
GPUs, the PEs are arranged in SIMD arrays consisting of 16 processing element
each. Each of the SIMD arrays executes a single instruction across a block of 16
work-items. That instruction is repeated over four cycles to process the 64-element
wavefront. In the Southern Island family, the four SIMD arrays can execute code
from different wavefronts.The APU has a quad-core CPU and an integrated GPU,
fused in the same silicon die. The integrated GPU of Llano is an Evergreen GPU
and has five GPU cores while the integrated GPU of Trinity is an Northern Island
GPU and has six GPU cores. On the other hand the discrete GPU of Cayman is an
Northern Island GPU and has 24 GPU cores while the discrete GPU of Tahiti is an
Southern Island GPU and has 32 GPU cores. The local memory, compute
units,Global memory,peak bandwidth and peak flops every thing is higher in
discrete GPUs than integrated GPUs.At the time of writing, zero copy on APUs is
only possible on Windows based systems. For the CPU a compute unit is actually a
CPU core and vector instructions are SSE(Streaming SIMD Extensions)
instructions.The survey is on one hexa-core CPU: AMD Phenom TM II x6 1055t
Processor, two discrete GPUs: a Cayman GPU and a Tahiti GPU, and two APUs:
Llano and Trinity. Table I shows the technical specifications of each tested device.
Additionally, the APU specifications are only about the integrated GPUs.
DATA PLACEMENT STRATEGIES :
The APU memory system is different from a discrete GPU. In an integrated GPU
system not only data can be explicitly copied from the system memory to the GPU
partition and vice versa but also memory objects, known as zero-copy objects, can
be shared between the CPU and the integrated GPU. The part of the GPU
memory that can be exposed to the CPU is called called the GPU persistent
memory.
GPU Read-only zerocopy memory objects are stored in the USWC
(Uncacheable,Speculative Write Combine) memory, an uncached memory in
which data is not stored into the CPU caches and is dependent upon CPU
contiguous write operations utilizing the write combine buffers to increase memory
throughput.In this segment the performance of two successive AMD APUs (family
codename Llano and Trinity) has been evaluated and compared. the different
memory locations of the APUs are refered as a lower case letter: c refers to the
regular cacheable CPU memory for efficient CPU-GPU data transfer, z refers to
the cacheable zero-copy memory objects, u to the uncached zero-copy memory
objects (USWC), g to the regular GPU memory and p to the GPU persistent
memory. In order to test the performance of the read and write accesses to these
buffers,an OpenCL benchmark, a data placement benchmark has been built. It
makes the integrated GPU copy data from an input buffer to an output buffer
stored in two different APU memory location. For both the input and output
buffers, cg (respectively gc) refers to an explicit data copy from the CPU partition
to the GPU partition g (resp. from the GPU partition g to the CPU partition c).z, u
and p refer to the corresponding zero-copy buffer.The results showed in Fig.1 are
for both Llano and Trinity, where init is the input buffer initialization time. iwrite
is the input buffer transfer time to the GPU memory. kernel is the execution time
of the OpenCL kernel, that copies data from the input buffer to the output buffer.
oread is the output buffer transfer time back to the CPU. obackup is the time of an
extra copy from the output buffer to a temporary buffer in the CPU memory in
order to measure the time of reading from the memory location in which the output
buffer is saved. All the tests are run up to 40 time after devices “warm up”. Firstly
the GPU reads from USWC are as fast as GPU reads from GPU memory. CPU
writes to GPU persistent memory are fast but reads are very slow showed in
obackup in Fig.1. Secondly Contiguous CPU writes to USWC (u) offer the highest
bandwidth for init. Finally, GPU memory accesses to z are slower than accesses to
u and g.
FINITE DIFFERENCE STENCIL OPENCL IMPLEMENTATIONS :
In this segment a description of the implementation and optimizations for the

single precision 3D finite difference stencil algorithm has been showed. The major
optimizations include use of vector instructions, use of the local memory, software
pipelining, register blocking and auto-tuning in order to make best use of each
tested device. The OpenCL implementations are highly tuned in order to provide a
fair performance comparison between all tested devices. To compute an 8th order
3D stencil the floating point computation and data storage complexities of the
stencil kernel are both O(N 3), the exact number of flops being 2 + ((3 ∗ 13) ∗ N
3). The compute intensity is O(1) and this algorithm is memory bound. All
memory accesses are issued in global memory. The code has been vectorized
explicitly and each work-item processes X columns of float 4 elements along the Z
dimension. This implementation is referred to as vectorized. Finally, a blocking
algorithm has been implemented by which data is fetched from global memory and
stored in local memory. Then, each workitem traverses the Z dimension and blocks
the grid values in the register file in order to reuse data and computes X float4
elements at a time. This implementation is referenced as local vectorized.
PERFORMANCE RESULTS AND COMPARISON :
In this section, some performance results have been showed for the OpenCL
kernels on each tested device. Additionally, the impact of data placement strategies
on APUs performance has been emphasized too. Finally, the performance of the
integrated GPUs against those of the CPU and the discrete GPUs has been
computed.
1)Devices performance:
Firstly, the performance numbers of each tested device have been showed based on
the kernel execution time only. For this purpose, 3D domains with NxNx32 sizes
(N ranging between 64 and 1024) have been used. Figure 2(a) summarizes the
performance of the different OpenCL implementations on the CPU. We compare
them against an OpenMP Fortran 90 code (compiled and vectorized with Intel
Fortran Compiler).
The OpenCL vector implementations outperform the OpenCL scalar version and
the local vectorized implementation. The vectorized implementation delivers the
best performance being faster than or as fast the OpenMP implementation.Figure
2(c) illustrates the performance numbers of Cayman which shows that the local
vectorized implementation is more efficient than the vectorized one.Figure 2(d)
illustrate the performance numbers of Tahiti which also shows that the local
vectorized implementation is more efficient than the vectorized
implementation.Figure 2(e) illustrate the performance numbers of Llano where it
shows local vectorized implementation and vectorized implementation are
performing almost similar yet local vectorized one is more efficient.Figure 2(f)
illustrate the performance numbers of Trinity in which local vectorized one
performs way too faster than the vectorized one . An important note that for
discrete GPUs, as well as for integrated GPUs, the local vectorized implementation
is more efficient than the vectorized and the scaler implementation performs the
slowest.The performance of Tahiti reaches up to 484 Gflop/s which has been the
fastest performing implementation .
2) Impact of data placement strategies on performance:
Stencil computations are usually used in iterative methods. Data snapshotting is the
process in which between two subsequent iterations, data need to be resident on the
CPU memory in order to be used in further operations. We call this process data
snapshotting. The frequency of data snapshotting which is also known as temporal
blocking can also be an important performance factor. considering the data
placement strategies and the frequency of data snapshotting We run the vectorized
and the local vectorized implementations of the stencil OpenCL kernel on APUs.
one input buffer and one output buffer has been used for this purpose . In Fig.2(g)
and Fig.2(h) we can see the performance results of the kernel ran on a
1024x1024x32 grid, respectively on Llano and Trinity, as a function of the number
of stencil computation passes performed before the snapshot (frequency of data
snapshotting). On LIano vectorized implementation performs closer to the local
vectorized implementation.on the other hand on Trinity local vectorized
implementation is was too faster than vectorized one.We conclude that in order to
obtain the best stencil performance on APUs, we have to use the local vectorized
implementation coupled with the cggc data placement strategy.
3) Performance comparison:
Finally, Fig.2(b) illustrates a performance comparison of the tested devices as a
function of the frequency of data snapshotting, using the best implementation on
each. The domain size of the used grid is 1024x1024x32. The best performance is
always obtained on the Tahiti discrete GPU. A Trinity APU with a unified memory
(trinity (comp-only)) may match the Tahiti performance, and outperform the
Cayman performance, for a snapshot retrieval after every stencil computation. And
it is noticeable that the integrated GPUs always outperform the CPU
implementation.
CONCLUSION :
In this paper the relevance of the APU integrated GPU for high performance
scientific computing by providing a comparative study of the CPU/APU/GPU
performance of an
OpenCL finite difference 3D stencil kernel has been discussed. The
implementations take advantage of the hardware improvements along successive
GPU generations. The new Tahiti discrete GPU delivers impressive performance
(up to 500 Gflops). The two integrated GPUs outperform the CPU for all data
snapshotting frequencies. But, only the integrated GPU of the latest APU (Trinity)
can match discrete GPUs for problems with high communication requirements.
Concerning about performance in both compute power and internal memory
bandwidth, between the latest discrete GPUs and the integrated GPUs in the latest
APUs there is indeed a big gap. For future APUs to be competitive for this kind of
applications (i.e to outperfom discrete GPUs), they should be endowed with more
powerful integrated GPUs and a faster memory system.Besides, we point out that
APUs are low consuming chips compared to discrete GPUs.The removal of the
PCI Express interconnection in the APUs encourages the use of hybrid CPU -
integrated GPU OpenCL implementations for this applicative kernel. From the
tests finally we can conclude that the discrete GPUs in APUs are more compute
powerful and have more internal memory bandwidth than integrated GPUs and the
results indicate that we can achieve very good performance on each tested device
with a set of OpenCL implementations.
CONVOLUTIONAL NEURAL NETWORKS FOR
SELFDRIVING CARS ON GPU
REVIEWD BY: Shukanna Abedin, Id- 1712706642
GPU which is short for Graphical Processing Unit is the heart of deep learning, a
part of artificial intelligence. This is a single chip processor used for extensive
graphical and mathematical computations which frees up CPU cycles for other
jobs. Every level in deep learning focuses on how to transform its input data into a
slightly more abstract and composite representation.
The velocity of a GPU relative to a CPU depends on the type of calculation being c
arried out. A computation that can be performed in parallel is the type of computati
on most fitting for a GPU.GPUs are mostly faster at computing than CPUs because
of the bandwidth. The best CPUs have around 50GB/s while the best GPUs have
750GB/s memory bandwidth. The standalone GPU comes with a VRAM memory.
Transferring large chunks of memory from CPU to GPU is a big challenge.
It takes a lot of CPU clock cycles to compute enormous and complex work. The ex
planation is that the CPU sequentially takes up the job and has a lower number of c
ores than its equivalent, the GPU. But though GPUs are faster, the time taken to
transfer huge amounts of data from CPU to GPU can lead to higher overhead time
depending on the architecture of the processors.
Self-driving vehicles are one of the most common ideas when it comes to the
development of technologies. They are in fact far better than the human-driving
cars considering how systematically they are thought to work. A self-driving car is
a vehicle that runs on its own without any need of human interaction or help. It is
considered to accomplish any task that a human driving car can do. Using deep
learning method, a vehicle is taught how to drive with the information collected
using visual data from the cameras installed in the machine. The problem however
lies in processing the data in time for which CNN Convolutional neural network) is
used to train the model. CNNs are used in GPUs to make this happen.
For over twenty years, Convolutional Neural Networks have been used for commer
cial purpose. On massively parallel graphics processing units (GPUs), CNN
learning algorithms can now be introduced, speeding up learning and performance
considerably.
A person drives a car. It has cameras for data collection. The result is a set of
frames, for each there is a value of the angle of rotation at that time.
Under various weather conditions, lighting and time of day, data is obtained. They
are sorted on the basis of the conditions given above.
We extend the dataset to train a network to get out of a
bad situation by introducing artificial disturbances.
However, the problem lies in processing the data in a short period of time. This is
where a GPU is chosen over a CPU as it is better suited for machine learning for its
technical features that help them perform large amount of data in a short time. The
GPU memory although does not provide a big volume so heterogenous
environment is used where the graphical accelerators help with calculations during
training and testing data and the analysis is then made on CPU. A graphics
accelerator is a computer microelectronics component (a chipset attached to a
video board) to which a computer program can offload the sending and refreshing
of images. Special effects typical to 2-D and 3-D images on the display monitor
and the measurement.
GPUs are currently being termed as the new CPUs in the time of cutting-edge
technologies. GPU has come out as the most dominant chip architecture for self-
driving technology because it is cost efficient and for its popularity. In terms of
creating GPU-powered AI platforms and teaming up with well-known automotive
giants, Nvidia, the world's leading GPU manufacturer, has been scoring major
wins. The US-based Nvidia recently launched a new version of the platform called
Pegasus after the launch of its original AI-based supercomputer platform Drive PX
in 2015, which can be utilized to power Level 5 autonomy. Completely
autonomous cars without pedals, steering wheels or mirrors can be supported by
this new platform. It has been built on Nvidia's CUDA GPUs which has intensified
its computing speed by 10 times and lowered the power consumption by 16 times.
Given the instances of the construction of new products by GPU manufacturers,
particularly Nvidia, it can certainly be argued that the criticality of GPU-powered
AI platforms in the efficient implementation of autonomous vehicles. Many more
such technologies are in the works and will accelerate the production of big data
systems powered by AI, in which GPUs will play a key role in the growth of AI.
Final conclusion:
For accomplishing elite from CPU-GPU Processors numerous means were taken in the
light of engineering progression in plan, computational strategies and advancement. We
will initially take a gander at the condition of craft of GPU-CPU structures and attempt to
comprehend CPU-GPU plan. Next, we are zeroing in on significant improvements in every
aspect of CPU-GPU Processors as indicated by research considers. We will at that point
look at the collaboration between CPU-GPU and their execution conduct. At last, we will
inspect future for CPU-GPU frameworks.

Cpu Gpu System

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cpu Gpu System

Uploaded by

Copyright:

Available Formats

COURSE : CSE332

GROUP : Rifat & Co.

PROJECT NAME: CPU - GPU SYSTEM

N M RIFAT Rahman, ID- 1812817042

ANIK RAIHAN APURBA, ID-1813424042

NOOR JAHAN JEDNE, ID- 1812108642

SHUKANNA ABEDIN, Id- 1712706642

Heterogenous Computing _ Here to Stay

 Overview of AMD Gpu

Evaluation of successive CPUs/APUs/GPUs

CONVOLUTIONAL NEURAL NETWORKS FOR SELFDRIVING CARS ON GPU

REVIEWED BY: N M Rifat Rahman, ID- 1812817042

Limitations of Homogenous computing & benefits of Heterogenous

Conventional homogeneous systems usually use one mode of parallelism in a

Types of Heterogenous computing: Heterogenous computing have different

The memory system is a nontrivial source of power consumption. Also considering

The measures of success can be achieved by determining the margins of

Conclusion: In modern days where we depend on the technologies mostly, we

In this paper, different types of heterogeneity of computation, it’s challenges and

On the Efficacy of a Fused CPU+GPU

REVIEWD BY : ANIK RAIHAN APURBA , ID-1813424042

The worldwide usage of compute-capable graphics processing units (GPUs) in

Overview of the AMD GPU

PCIe Bottlenecks with Discrete GPUs

REVIEWD BY: Noor Jahan jedne Id : 1812108642

The AMD APU (Accelerated Processing Unit), a series of 64-bit microprocessors

OpenCL (Open Computing Language) is an open standard platform for writing

DATA PLACEMENT STRATEGIES :

FINITE DIFFERENCE STENCIL OPENCL IMPLEMENTATIONS :

In this segment a description of the implementation and optimizations for the

REVIEWD BY: Shukanna Abedin, Id- 1712706642

You might also like