Chapter 4 (Processors and Memory Hierarchy)

Advanced Computer Architecture
Kai Hwang & Naresh Jotwani
Chapter Four
Processors and Memory Hierarchy
Advanced Processor Technology
• Instruction Pipelines
A typical instruction includes four phases: ifetch, decode, execute, write-back. These phases are often
executed by an instruction pipeline depicted in Fig 4.1. A pipeline cycle is the time required for each phase to
complete its operation, assuming equal delay in all phases. Some terms associated with pipelining are:
 Instruction pipeline cycle:- The clock period of instruction pipeline.
 Instruction issue latency:- The time (in clocks) required between issuing the two adjacent instructions.
 Instruction issue rate:- Number of instructions issued per cycle, also degree of a superscalar computer.
 Simple operation latency: These latencies are measured in numbers of cycles when simple operations
such as add, load, store, branch, move etc are being executed.
 Resource conflicts: Refers to the situation where two or more instructions demand using the same
functional unit at the same time.
Fig 4.1: Execution in a base scalar processor

As depicted in Fig 4.1, a base scalar processor works with one instruction issued per cycle, a one latency
for a simple operation and a one cycle latency between instruction issue. If successive instructions can enter in
the pipeline at the rate of one per cycle, it will be then fully utilized.
Fig 4.2 shows that an instruction can take more than one cycle to be issued. In this case, two cycles per
instruction is issued and thus the pipeline is underutilized. As for Fig 4.3, another scenario for an underpipelined
situation is to double the cycle time by combining two or more stages.
Fig 4.2: Two cycle per instruction but underpipelined
Fig: 4.3: Cycle time doubled per instruction but underpipelined

• Instruction-Set Architectures
 CISC (Complex Instruction Set Computing) processors: In early days, the instruction sets were simple as
the hardware cost was high. But when the cost was dropped, more and more functions were built into
the hardware, resulting the instruction set being large and complex.
 RISC (Reduced Instruction Set Computing) processors: When it was seen that only 25% of all complex
instruction set has been used, an idea of elaborate with low-frequency instructions was appeared to be
executed via software rather than using software. This would help with clearing valuable space in
hardware area to build more effective scalar processors.
 Architecture Distinctions: Though, in recent days, processors are now designed with features form both
CISC and RICS, the fundamental distinctions stayed behind as depicted in Fid 4.4 and Fig 4.5.
Fig 4.4: Microprogrammed controlled and unified cache CISC 4.5: Hardwired controlled and split cache RISC
• Characteristics of Typical CISC and RISC Architectures:
• A CISC processor architecture (VAX 8600 processor)
 VAX 8600 is made of two separate functional units
(integer and floating point instructions) for
concurrent execution.
 The unified cache memory is used for
holding both the instructions and data.
 The instruction unit fetches and decodes
instructions, handles branches and feeds
operands to the two functional units.
 The TLM (Translation Lookaside Buffer)
is used to generate physical address from
a virtual one.
 Both integer and floating-point units are
pipelined and the performances rely on
the cache hit ratio and minimal branching
damage to the flow.
• A RISC processor architecture
(Intel i860 processor)
 The Intel i860 processor has nine functional
units interconnected by multiple data paths.
 All address buses are 32-bit wide and all
the data buses are 64-bits wide.
 Instruction cache: The 4 kbytes sized cache
is organized as a two way set associative
memory and transfer 64 bits per cycle clock.
 Memory management unit: Manages the
Instruction and data address for further
proceeding.
 Data cache: Also a two way set associative
memory of 8 kbytes and transfer 128 bits per
cycle clock.
 Bus control: Coordinates the 64 bit data
transfer between the chip and the external
connections.
 Integer unit(IU: IU executes load, store, integer
and control instructions and fetches instructions for FPU
units as well.
 Floating point unit(FPU): FPU coordinates
two other basic units called multiplier unit
and adder unit.
• Multiplier and Adder unit: Under the supervision of FPU, these two units operates simultaneously.
Special dual floating point instructions like add-and-multiply and subtract-and-multiply use both
adder and multiplier units.
• Graphics unit: Can work with 8, 16 or 32 bits pixel data types and also supports three dimensional
drawing with color intensity and shading.
Superscalar and Vector processors
• Pipelining in Superscalar processor:
 As mentioned earlier, a base scalar processor either in RISC or CISC, has m=1. So, in order to
exploit higher degree of instruction level parallelism, a superscalar processor of degree m must
execute m instructions per cycle.
 The micro/simple operation latency should require only one cycle as in the base scalar processor.
 Typical superscalar processor issues two to five instructions per cycle.
Fig 4.9: Degree m=3 superscalar processor

• The VLIW (Very Large Instruction Word) pipelining
 Typical VLIW machine has instruction words consisting hundreds of bits in length.
 In VLIW processor, large instructions are broken down only to execute as a basic operation at same
speed
 Multiple functional units work concurrently in a VLIW processor.
 Functional units share the common large register file as shown in Fig 4.10.
 These functional units execute simultaneous operations 256-1024 bits per instruction word in a
synchronized way.
 Conventional short instruction words (e.g. 32 bits) are compacted together in VLIW instruction words.
Fig 4.10: Typical VLIW processor with degree m=3 Fig 4.11: VLIW processor pipelining with degree m=3
• Hierarchical Memory Technology
• Storage devices are organized as hierarchy as shown in
figure on the right.
• The memory technology and storage organization at each
level are characterized by five parameters:
 Access time (𝑡𝑖 ): Refers to the round trip from CPU to
the ith level memory.
 Memory size (𝑠𝑖 ): Refers to the memory size in bytes
or words in ith level memory.
 Cost per byte (𝑐𝑖 ): Refers to the cost of the ith level
memory evaluated by the product (𝑐𝑖 𝑠𝑖 ).
 Transfer bandwidth (𝑏𝑖 ): Refers to the rate at which
data are transferred between adjacent levels.
 Unit of transfer (𝑥𝑖 ): Refers to the grain size for data
transfer between two adjacent levels.
Where 𝑖 = 1,2,3 … 𝑛
• Inclusion property
• The inclusion property states as 𝑀1 ⊂ 𝑀2 ⊂ 𝑀3 ⊂ ⋯ ⊂ 𝑀𝑛 . This implies that all information
are stored in the outermost level 𝑀𝑛 . So that subsets of 𝑀𝑛 are copies into 𝑀𝑛−1 , subsets of
𝑀𝑛−1 are copied into 𝑀𝑛−2 and so on.
• Information transfer between CPU and cache is in terms of words.
• The cache is divided into cache blocks (typically 32 bytes or 8 words).
• Information transfer between cache and main memory is in terms of blocks.
• Main memory is divided into pages (typically 4 kbytes) which in fact are consisted of series of
blocks.
• Pages are the units of data transfer between main memory and disks or other external
drives.
• Pages are organized as segment in the disk memory.
Fig 4.13: Inclusion property and data transfer between memory hierarchy
• Coherence property
 Coherence property requires that the copies in successive memory levels be consistent.
 If a word is modified in cache, then it must be updated immediately through all higher levels.
 There are two ways to maintain the coherence in memory hierarchy:
o Write-through (WT): Immediate updating in level 𝑀𝑖+1 if there is any modification in level 𝑀𝑖 .
o Write-back (WB): Delaying the updating in level 𝑀𝑖+1 until the modified data is replaced or
removed in level 𝑀𝑖 .
• Locality of references
 Also known as principle of locality depicts the tendency of a processor to access the same set of
memory locations repetitively over a short period of time for future use. There are three properties
of locality:
o Temporal: Recently referenced items tend to be accessed again in the near future e.g.
instructions in a loop, local variables, subroutines.
o Spatial: Access tends to be clustered e.g. array or program segments that are situated in
neighboring addresses.
o Sequential: Instructions tend to be executed sequentially unless an out-of-order instruction (e.g.
branch) arrives.
• Memory Capacity Planning
 Hit ratio: Hit ratio is defined for any two adjacent memory levels. When an information is found in 𝑀𝑖 ,
(where 𝑖 = 1,2,3, … , 𝑛) we call it a hit, otherwise a miss.
 Hit ratio (ℎ𝑖 ) at 𝑀𝑖 is a probability that the information be found at 𝑀𝑖 or not. The miss ratio at 𝑀𝑖 is
defined as (1 − ℎ𝑖 ).
 Since, hit ratio is a probability function, and if there are n levels in the memory hierarchy, then the
outermost memory 𝑀𝑛 is always a hit, thus, ℎ𝑛 = 1.
 Access frequency: Access frequency at 𝑀𝑖 is defined as:
𝑓𝑖 = (1 − ℎ1 ) 1 − ℎ2 … (1 − ℎ𝑖−1 )ℎ𝑖
This means that a hit occurs at 𝑀𝑖 , when there are 𝑖 − 1 number of misses at the lower memory
levels. So the equation below exists:
𝑛
෍ 𝑓𝑖 = 1
𝑖=1
and 𝑓1 = ℎ1 , when there is only one level in the memory hierarchy
 Effective access time (𝑇𝑒𝑓𝑓 ): Hit ratio must be as high as possible at level 𝑀𝑖 . There is a penalty for
every miss at every level. Considering this fact, the effective access time can be defined as follows:
𝑛
𝑇𝑒𝑓𝑓 = ෍ 𝑓𝑖 . 𝑡𝑖
𝑖=1
= ℎ1 𝑡1 + 1 − ℎ1 ℎ2 𝑡2 + (1 − ℎ1 ) 1 − ℎ2 ℎ3 𝑡3 + ⋯ + 1 − ℎ1 1 − ℎ2 … (1 − ℎ𝑛−1 )ℎ𝑛 𝑡𝑛
 Hierarchy optimization: The total cost (𝐶𝑡𝑜𝑡𝑎𝑙 )of memory hierarchy is evaluated as below:
𝑛
𝐶𝑡𝑜𝑡𝑎𝑙 = ෍ 𝐶𝑖 𝑆𝑖
𝑖=1
This equation is a distributive form over n memory levels. Since 𝑐1 > 𝑐2 > 𝑐3 > ⋯ > 𝑐𝑛 , we have
to choose 𝑠1 < 𝑠2 < 𝑠3 <…< 𝑠𝑛 .
Math Problems

Chapter 4 (Processors and Memory Hierarchy)

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 4 (Processors and Memory Hierarchy)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 (Processors and Memory Hierarchy)

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture

Kai Hwang & Naresh Jotwani

Fig 4.1: Execution in a base scalar processor

Fig 4.2: Two cycle per instruction but underpipelined

Fig: 4.3: Cycle time doubled per instruction but underpipelined

Fig 4.9: Degree m=3 superscalar processor

You might also like