Aca 4

Advanced Computer Architecture
CSD-411
Department of Computer Science and Engineering

National Institute of Technology Hamirpur
Hamirpur, Himachal Pradesh - 177005
ABOUT ME: DR. MOHAMMAD AHSAN
• PhD – National Institute of Technology

Hamirpur (H.P.)
• M.Tech – National Institute of Technology
Hamirpur (H.P.)
• Qualified UGC NET June-2015 and UGC NET
Nov-2017 for Assistant Professor
• Qualified GATE 2012, GATE 2013 and GATE
2021.
• Experience: NIT Hamirpur and NIT Andhra
Pradesh.
•
•.
Vector Processor
• Vector architecture grab sets of data elements scattered in memory, place
them into large, sequential register files, operate on data in those register
files, and then disperse the results back into memory.
• Exploit data-level parallelism by applying a single instruction to a
collection of data in parallel.
• Components of VMIPS are:
• Vector registers
• Vector functional units
• Vector load/store unit
• A set of scalar registers
Example
•Y=axX+Y
X and Y are vectors, initially resident in memory, and a is a scalar.
• VMIPS code for DAXPY
MIPS Code
• The reduction in number of instructions occurs because the vector

operations work on 64 elements.
• In MIPS code, every ADD.D must wait for a MUL.D and every S.D must
wait for the ADD.D. On the vector processor, each vector instruction will
only stall for the first element in each vector.
• The pipeline stall frequency on MIPS is about 64 times higher than it is on
VMIPS.
Vector-Length
• The number of elements in each vector register.
• This length, which is 64 for VMIPS, is unlikely to match the real vector
length in a program.
• Example:
• The size of the vector operations depends on n, which might be subject to

change during execution.
• Solution: Vector-length register (VLR)
• VLR controls the length of any vector operation, including a vector load or
store.
• The value in the VLR can not be greater than the length of the vector
registers.
• This solves the problem as long as the real length is less than or equal to
the maximum vector length (MVL).
What if n > MVL?

• Strip mining technique is used.
• Strip mining is the generation of code such that each vector operation is
done for a size less than or equal to the MVL.
• Create one loop to handle any number of iterations that is a multiple of the
MVL and another loop to handle remaining iterations.
• Strip-mined version of the DAXPY loop:
Stride: Handling Multidimensional Arrays in Vector Architectures

• The position of adjacent elements in a vector may not be sequential in
memory.
• Example:
• Multiplication of each row of B with each column of D.

• When an array is allocated memory, it is either stored in row-major (as in
C) or column-major (as in Fortran) order.
• The elements of the D that are accessed by iterations in the inner loop are
separated by the row size times 8 (the number of bytes per entry).
• This distance separating elements to be gathered into a single register is

called the stride.
• In example, matrix D has a stride of 100 double words (800 bytes), and
matrix B has a stride of 1 double word (8 bytes).
• Once a vector is loaded into a vector register, it acts as if it had logically
adjacent elements.
• This ability to access nonsequential memory locations and to reshape them
into a dense structure is one of the major advantages of a vector processor.
Enhancing Vector Performance

• Convoy
• Set of vector instructions that can be executed together.
• The instructions in a convoy must not contain any structural hazard; if such hazards
are present, the instructions need to be serialized and initiated in different convoys.
• Assumption: A convoy of instructions must complete execution before any other
instructions (scalar or vector) can begin execution.
• Chaining
• It allows a vector operation to start as soon as the individual elements of its vector
source operand become available: the results from the first functional unit in the
chain are “forwarded” to the second functional unit.
• Chime
• The unit of time taken to execute one convoy.
• A vector sequence that consists of m convoys executes in m chimes.
• Using chime measurement rather than clock cycles per result, indicate that we are
ignoring certain overheads.
• Overheads:
1. Limitation on initiating multiple vector instructions in a single clock cycle.
2. Start-up time – it is principally determined by the pipelining latency of the
vector functional unit. The pipeline depths are:
▪ for floating-point add – 6 clock cycles
▪ for floating-point multiply – 7 clock cycles
▪ for floating-point divide – 20 clock cycles, and
▪ for vector load – 12 clock cycles.
• How many chimes will this vector sequence take?
• First convoy starts with the first LV instruction. The MULVS.D is dependent on
the first LV, but chaining allows it to be in the same memory.
• Second LV instruction is in a separate convoy as there is a structural hazard on
the load/store unit for the prior LV instruction.
• SV is in third convoy as it has a structural hazard on the LV in the second
convoy.
• The chime approximation is reasonably accurate for long vectors.

• According to example on previous slide, for 64-element vectors, the time
in chimes is 3, and sequence take about 64 x 3 or 192 clock cycles. The
overhead of issuing convoys in two separate clock cycles is small.
Chaining
• Even though a pair of operations depends on one another, chaining allows
the operations to proceed in parallel on separate elements on the vector.
Effectiveness of Compiler Vectorization

• Two factors affect the success with which a program can be run in vector
mode:
1. Structure of the program itself: do the loops have true data dependences, or can
they be restructured so as not to have such dependences?
2. The capability of the compiler. While no compiler can vectorize a loop where no
parallelism among the loop iterations exists, there is a tremendous variation in
how well different compilers do in vectorizing programs.
• Performance equation for the execution time of a vector loop with n

elements:
• A 500 MHz VMIPS would run this loop at 333 MFLOPS assuming no
strip-mining or start-up overhead.
• Ways to improve the performance:

1. Add additional vector load-store units, allow convoys to overlap to reduce the
impact of start-up overheads, and
2. Decrease the number of loads required by vector-register allocation.
Hardware and Software for VLIW and EPIC

• VLIW: Very Large Instruction Word
• EPIC: Explicitly Parallel Instruction Computing
Exploiting ILP Statically
• The core concepts that we exploit in statically based techniques – finding
parallelism, reducing control and data dependences, and using speculation.
• These techniques are applied at compile time by the compiler rather than at
runtime by the hardware.
• Advantages of compile time techniques:

i. do not burden runtime execution with any inefficiency, and
ii. can take into account a wider range of the program than a runtime approach
might be able to incorporate. For example, a compiler might determine that an
entire loop can be executed in parallel and hardware techniques might or might
not be able to find such parallelism.
• Disadvantage:
• can use only compile time information. Without runtime information, compile
techniques must often be conservative and assume the worst case.
Detecting and Exploiting Loop-Level Parallelism

• This analysis focuses on determining whether data accesses in later
iterations are dependent on data values produced in earlier iterations.
• In this loop, dependence between the two uses of x[i]
• Successive uses of i in different iterations

• The analysis of loop-level parallelism involves recognizing structures such

as loops, array references, and induction variable computations.
• Compiler can do this analysis more easily at or near the source level than
machine-code level.
• Example: What are the dependences between S1 and S2? Is this loop
parallel? If not, show how to make it parallel.
• Statement S1 uses the value assigned in the previous iteration by statement

S2
• Despite this loop-carried dependence, this loop can be made parallel. A
loop is parallel if it can be written without a cycle in the dependences.
• This dependence is not circular: neither statement depends in itself, and,
although S1 depends on S2, S2 does not depend on S1.
• Two observations to transform the code:

• Interchanging the two statements will not affect the execution of S2 as there is no
dependence from S1 to S2.
• On the first iteration of the loop, statement S1 depends on the value of B[1]
computed prior to initiating the loop.
• Transformed code:
• The dependence between the two statements is no longer carried, so

iterations of the loop may be overlapped, provided the statements in each
iteration are kept in order.
How compiler detect dependences?

• To determine whether there is a dependence between two references to the
same array in a loop, check the following conditions. A dependence exists
if two conditions hold:
• There are two iteration indices, j and k, both within the limits of the for loop.
• The loop stores into an array element indexed by and later fetches from the
same array element when it is indexed by
• In general, we cannot determine whether a dependence exists at compile
time. For example, the values of a, b, c, and d may not be known (they
could be values in other arrays), making it impossible to tell if a
dependence exists.
• Many programs contain simple indices where a, b, c, and d are all
constants. For these cases, it is possible to devise reasonable compile time
tests for dependence.
• A simple and sufficient test for the absence of a dependence is the greatest
common divisor (GCD) test.
• If a loop-carried dependence exists, then GCD(c,a) must divide (d – b).
• Example: Use the GCD test to determine whether dependences exist in the
following loop:
• Solution:
Eliminating Dependent Computations

• Compilers can reduce the impact of dependent computations so as to
achieve more ILP.
• Examples:
• When loop are unrolled, this sort of optimization is important to reduce the
impact of dependences arising from recurrences.
Scheduling and Structuring Code for Parallelism

• Techniques developed for this purpose:
i. Software pipelining, and
ii. Trace scheduling
Software Pipelining: Symbolic Loop Unrolling

• Each iteration in the software-pipelined code is made from instructions
chosen from different iterations of the original loop.
• By choosing instructions from different iterations, dependent computations
are separated from one another by an entire loop body, increasing the
possibility that the unrolled loop can be scheduled without stalls.
• A software pipelined loop interleaves instructions from different iterations
without unrolling the loop.
• This technique is the software counterpart to what Tomasulo’s algorithm
does in hardware.
• Loop example:
• The body of the unrolled loop without overhead instructions:

• Selected instructions from different iterations are then put together in the
loop with the loop control instructions:
• The execution pattern for a software-pipelined loop:

•
Hardware Support for Exposing Parallelism: Predicated Instructions

• Loop unrolling and software pipelining can be used to increase the amount
of parallelism available when the behavior of branches is fairly predictable
at compile time.
• When the behavior of branches is not well known, compiler techniques
alone may not be able to uncover much ILP. In such cases, the control
dependences may severely limit the amount of parallelism that can be
exploited.
• To overcome these problems, an architect can include conditional or
predicated instructions.
Predicated instruction
• Concept: An instruction refers to a condition, which is evaluated as part of
the instruction execution. If the condition is true, the instruction is
executed normally; if the condition is false, the execution continues as if
the instruction were a no-op.
• These instructions can be used to eliminate branches, converting a control
dependence into a data dependence and potentially improving
performance.
• Example: conditional move instruction – move a value from one register to
another if the condition is true.
• It can be used to completely eliminate a branch in simple sequences.
• Example:
• Assuming that registers R1, R2, and R3 hold the values of A, S, and T,
respectively.
• Code using a branch:
• Using a conditional move:

Hardware Support for Compiler Speculation
• In many cases, we like to move speculated instructions not only before the
branch but also before the condition evaluation, and predication cannot
achieve this.
• To speculate ambitiously requires three capabilities:
i. The ability of the compiler to find instructions that, with the possible use of
register renaming, can be speculatively moved and not affect the program data
flow.
ii. The ability to ignore exceptions in speculated instructions, until we know that
such exceptions should really occur.
iii. The ability to speculatively interchange loads and stores, or stores and stores,
which may have address conflicts.
•
•
Advanced Topics in Disk Storage

• Improvement in disk capacity is expressed as improvement in areal
density, measured in bits per square inch:
• DRAM latency is about 100,000 times less than disk, and performance
advantage costs 30 to 150 times more per gigabyte for DRAM.
Disk Power
• Power is an increasing concern for disks as well as for processors.
• A typical ATA disk in 2011 might use 9 watts when idle, 11 watts when
reading or writing, and 13 watts when seeking.
• Smaller platters, slower rotation, and fewer platters all help in reducing the
disk motor power.
RAID
• It stands for either Redundant Array of Independent Disks or Redundant
Array of Inexpensive Disks.
• It is a technology that is used to increase the performance and/or reliability
of data storage.
RAID 0 – Striping
• In a RAID 0 system data are split up into blocks.
• RAID 0 offers great performance, both in read and write operations.
• RAID 0 does not provide redundancy or fault tolerance.
RAID 1 – Mirroring
• Data are stored twice by writing them to both the data drive (or set of data
drives) and a mirror drive (or set of drives).
• If a drive fails, the controller uses either the data drive or the mirror drive
for data recovery and continuous operation.
• You need at least 2 drives for a RAID 1 array.
• RAID-1 is ideal for mission critical storage, for instance for accounting
systems.
•
RAID 2
• It is an original RAID level but rarely used today.
• It is a striping technology that stripes at the bit level instead of the block
level, and uses a complex type of error correcting code that takes the place
of parity.
RAID 3
• It uses byte-level striping and parity, and stores parity calculations on
dedicated disk.
RAID 4
• It stripes block level data and dedicates a disk to parity.
RAID 5 – Striping with parity

• RAID 5 is the mist common secure RAID level.
• It requires at least 3 drives but can work with up to 16.
• Data blocks are striped across the drives and on one drive a parity
checksum of all the block data is written.
• The parity data are not written to a fixed drive, they are spread across all
drives.
• Read data transactions are very fast while write data transactions are
somewhat slower (due to parity that has to be calculated).
• It is ideal for file and application servers that have a limited number of data
drives.
•
RAID 6 – Striping with double parity

• Parity data are written to two drives.
• It requires at least 4 drives and can withstand 2 drives dying
simultaneously.
• If two drives fail, you still have access to all data, even while the failed
drives are being replaced. So RAID 6 is more secure than RAID 5.
•
RAID 10 – Combining Mirroring and Striping

• It is possible to combine the advantages of RAID 0 and RAID 1 in one
single system.
• This is a nested or hybrid RAID configuration.
• It provides security by mirroring all data on secondary drives while using
striping across each set of drives to speed up data transfers.
•

Aca 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aca 4

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture

Department of Computer Science and Engineering

• PhD – National Institute of Technology

• The reduction in number of instructions occurs because the vector

• The size of the vector operations depends on n, which might be subject to

What if n > MVL?

Stride: Handling Multidimensional Arrays in Vector Architectures

• Multiplication of each row of B with each column of D.

• This distance separating elements to be gathered into a single register is

Enhancing Vector Performance

• The chime approximation is reasonably accurate for long vectors.

Effectiveness of Compiler Vectorization

• Performance equation for the execution time of a vector loop with n

• Ways to improve the performance:

Hardware and Software for VLIW and EPIC

• Advantages of compile time techniques:

Detecting and Exploiting Loop-Level Parallelism

• In this loop, dependence between the two uses of x[i]

• Successive uses of i in different iterations

• The analysis of loop-level parallelism involves recognizing structures such

• Statement S1 uses the value assigned in the previous iteration by statement

• Two observations to transform the code:

• The dependence between the two statements is no longer carried, so

How compiler detect dependences?

Eliminating Dependent Computations

Scheduling and Structuring Code for Parallelism

Software Pipelining: Symbolic Loop Unrolling

• The body of the unrolled loop without overhead instructions:

• The execution pattern for a software-pipelined loop:

Hardware Support for Exposing Parallelism: Predicated Instructions

• Using a conditional move:

Advanced Topics in Disk Storage

RAID 5 – Striping with parity

RAID 6 – Striping with double parity

RAID 10 – Combining Mirroring and Striping

You might also like