Ee6304 MJ Lec 7

EE (CE) 6304 Computer Architecture Lecture #7 (9/18/13)
Myoungsoo Jung Assistant Professor Department of Electrical Engineering University of Texas at Dallas
Virtual Memory y Review
Views of Memory
Real machines have limited amounts of memory
640KB? A f few GB? (This laptop = 2GB)

Programmer doesnt want to be bothered
Do y you think, , oh, , this computer p only y has 128MB so Ill write my code this way f you run on a d different fferent What happens if machine?
Programmers Programmer s View

Example 32-bit memory
When programming, you dont care about y how much real memory there is Even if you use a lot, memory can always b paged d to di k be disk
0-2GB
Kernel
Text Data Heap
Stack A K A Vi t l Add A.K.A. Virtual Addresses 4GB
Programmers Programmer s View

Really y Programs g View Each program/process gets its own 4GB space
Or much, much more with a 64 64-bit bit processor

Kernel Kernel Text Data Heap Text Data Heap Text Data Heap p Kernel Stack
Stack
Stack
CPUs CPU s View

At some point, the CPU is going to have to loadfrom/store-to from/store to memory memory all it knows is the real real, A A.K.A. K A physical memory
which unfortunately y is often < 4GB and is almost m never 4GB per process and is never 16 exabytes per process
Pages
Memory is divided into pages, which are nothing more than fixed sized and aligned regions of memory
Typical size: 4KB/page (but not always)

0-4095 4096-8191 8192-12287 12288-16383 12288 16383 Page 0 Page 1 Page 2 g 3 Page
Page Table
Map from virtual addresses to physical locations
0K 0K 4K 8K 12K Page Table implements this VP mapping 4K 8K 12K 16K 20K 24K Virtual Addresses Entry includes permissions (e.g., read readonly) 28K Physical Addresses
Physical Location may include hard-disk
Need for Translation

0xFC51908B Vi t l Address Virtual Add Virtual Page Number Page Offset
Physical Address Page Table Main Memory 0x00152 0x0015208B
0xFC519
Page Tables
0K 4K 8K 12K
Physical Memory 0K 4K 8K 12K 16K 20K 24K 28K
0K 4K 8K 12K
What is in a Page Table Entry (or PTE)? Pointer to actual page Permission P i i bit bits: valid, lid read-only, d l read-write, d it write-only it l Example: Intel x86 architecture PTE: Address same format previous slide (10, 10, 12-bit offset) Intermediate I di page tables bl called ll d Di Directories i Page Frame Number (Physical Page Number) 31-12 P: W: U: PWT: PCD: A: D: L: Free 0 L D A UW P (OS) 11-9 8 7 6 5 4 3 2 1 0
PWT T
What is in a Page Table Entry (PTE)?
Present (same as valid bit in other architectures) Writeable User accessible Page write transparent: external cache write-through Page cache disabled (page cannot be cached) Accessed: page has been accessed recently Dirty y (PTE only): y page p g has been modified recently y L=14MB page (directory only). Bottom 22 bits of virtual address serve as offset
PCD D
Three Advantages of Virtual Memory

Translation: g can be given g consistent view of memory, y even though g Program physical memory is scrambled Makes multithreading reasonable (now used a lot!) mportant part of program ( (Working Work ng Set Set) ) must Only the most important be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs Sharing: an map same phys physical cal page to multiple mult ple users Can (Shared memory)
Large L g Address Space p Support pp

Virtual Address: 10 bits
Virtual Virtual P1 i index d P2 i index d
10 bits
12 bits
Offset
Address: Page #
Physical Physical
Offset
4KB
PageTablePtr
4 bytes
Single-Level Page Table Large 32 bit address 1M 4KB pages for a 32-bit entries Each process needs own page table! Multi-Level Page Table Can allow sparseness of page table Portions of table can be swapped to disk
4 bytes
TLB Review
Translation Look-Aside Look Aside Buffers

Translation Look-Aside Buffers (TLB)
Cache on translations Fully Associative, Set Associative, or Direct Mapped
VA CPU TLB miss T ns Translation data hit PA Cache hit miss Main Memory
Translation with a TLB
TLBs are:
Small typically yp y not more than 128 256 entries Fully Associative
Caching g Applied pp to Address Translation
CPU
Virtual Address
TLB
Cached? C h d? Yes No
Physical Address dd
Physical Memory
Translate (MMU) Data Read or Write (untranslated) Question is one of page locality: does it exist? Instruction accesses spend a lot of time on the same page (since accesses sequential) Stack accesses have definite locality of reference locality but still some some Data accesses have less page locality, Can we have a TLB hierarchy? Sure: multiple levels at different sizes/speeds
What Actually Happens on a TLB Miss?

Hardware traversed page tables: On TLB miss miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) If PTE valid, hardware fills TLB and processor never knows If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards Software traversed Page tables (like MIPS) On TLB miss, processor receives TLB fault Kernel traverses page table to find PTE If PTE valid, fills TLB and returns from fault If PTE marked as invalid invalid, internally calls Page Fault handler Most chip sets provide hardware traversal Modern operating systems tend to have more TLB faults since they use translation for many things Examples: shared segments user-level user level portions of an operating system
Implementing LRU
Have LRU counter for each line in a set When line accessed
Get old value X of its counter Set its counter to max value y other line in the set For every
If counter larger than X, decrement it W When replacement p m needed
Select line whose counter is 0
Clock Algorithm: Not Recently Used

Single Clock Hand: Advances only on page fault!
Set of all pages in Memory
Check for pages not used recently g used Table Mark p pages g as Page not recently y
dirty used
Replace an old page, not the oldest page Details: per physical p y page: p g Hardware use bit p Hardware sets use bit on each reference If use bit isnt set, means not referenced in a long time On page fault: Advance Ad clock l k h hand d ( (not t real l ti time) ) Check use bit: 1used recently; clear and leave alone 0selected candidate for replacement
1 1 0 1 Clock Algorithm: Approximate LRU (approx to approx to MIN) 0
0 0 1 1 0
...
Example: R3000 pipeline

MIPS R3000 Pipeline Inst Fetch TLB Dcd/ Reg g RF ALU / E.A Operation E.A. TLB 64 entry, on-chip, fully associative, software TLB fault handler Virtual Address Space
ASID 6 V. Page Number 20 Offset 12
Memory D-Cache
Write Reg WB
I-Cache
TLB
0xx User segment (caching based on PT/TLB entry) 100 Kernel physical space, cached 101 Kernel physical space, uncached p 11x Kernel virtual space Allows context switching among 64 user processes without TLB flush
As described, TLB lookup is in serial with cache lookup:

Virtual Address V page no. 10 offset TLB Lookup V
A Access Rights
Reducing translation time further
PA
P page no.
offset 10
Ph i l Address Physical Add
Machines with TLBs go one step further: they overlap TLB lookup with cache access. access
Works because offset available early
Here is how this might work with a 4K cache:

assoc 32 TLB lookup 20 page # Hit/ Miss FN = index 10 2 disp 00
Overlapping TLB & Cache Access
4K Cache 4 bytes
1 K
FN Data
Hit/ Miss
What if cache size is increased to 8KB? Overlap not complete Need to do something else Another option: Virtual Caches Tags in cache are virtual addresses Translation only happens on cache misses
Summary: TLB, TLB Virtual Memory

Page tables map virtual address to physical address TLBs TLB are important i f for f fast translation l i
TLB misses are significant in processor performance f most systems cant access all of 2nd level cache without TLB misses!
Caches, TLBs, Virtual Memory all understood by examining i i h how th they deal d l with ith 4 questions: ti 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Today VM allows many processes to share single memory without having to swap all processes to disk;
Exceptions: Traps and Interrupts
(H d (Hardware) )
Exceptions: Traps and Interrupts
Exception vs vs. Interrupt

Exception: An unusual event happens to an instruction during its execution Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting)
Problems with Pipelining

Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) The effect of all instructions up to and including Ii is complete No N effect ff t of f any i instruction t ti after ft Ii can take t k place The interrupt p (exception) ( p ) handler either aborts program or restarts at instruction Ii+1
Example: Device Interrupt

(Say, ( y, arrival of f network message) m g )
Raise priority
add
External Interrup pt
Reenable All Ints Save registers lw addi sw r2,0(r1) r3,r0,#5 0(r1),r3

Int terrupt H Handler
r1,r2,r3 r4,r1,#4 r4,r4,#2
subi slli
lw
r1,20(r0)
Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2
Restore registers Clear current Int Disable All Ints Restore priority p y RTE
( (again, i f for arrival i l of f network t k message) )

E External Interru upt
Alternative: Polling
Disable Network Intr subi slli lw lw add dd sw lw beq lw lw addi sw Clear r4,r1,#4 r4 r4 #2 r4,r4,#2 r2,0(r4) r3,4(r4) r2,r2,r3 2 2 3 8(r4),r2 r1,12(r0) r1,no_mess r1,20(r0) r2,0(r1) r3,r0,#5 0(r1),r3 Network Intr
Polling Point ( h k device (check d i register) i t )
Handler
no_mess:
Polling is faster/slower than Interrupts Interrupts.

Polling is faster than interrupts because Compiler knows which registers in use at polling point. point Hence, Hence do not need to save and restore registers (or not as many). Other interrupt overhead avoided (pipeline flush, trap p i iti priorities, etc). t ) Polling is slower than interrupts because Overhead of polling instructions is incurred regardless of whether or not handler is run. This could add to inner-loop delay. Device may have to wait for service for a long time time. When to use one or the other? Multi-axis tradeoff Frequent/regular events good for polling, as long as device
can be controlled at user level.
Interrupts good for infrequent/irregular events Interrupts good for ensuring regular/predictable service of events.
Trap/Interrupt classifications
Traps: relevant to the current process Faults, arithmetic traps, and synchronous traps Invoke software on behalf of the currently executing process Interrupts: caused by asynchronous, outside events I/O devices requiring service (DISK, network) Clock interrupts (real time scheduling) Machine h Checks: h k caused d b by serious h hardware d f failure l Not always restartable Indicate I di t th that t b bad d thi things h have h happened. d Non-recoverable ECC error Machine room fire Power outage
A related classification: Synchronous vs. Asynchronous

Synchronous: means related to the instruction stream, i e during the execution of an instruction i.e. Must stop an instruction that is currently executing Page fault on load or store instruction Arithmetic exception p Instructions Software Trap Asynchronous: means unrelated to the instruction stream, i.e. caused by an outside event. Does D not have h to disrupt d instructions that h are already l d executing Interrupts are asynchronous Machine checks are asynchronous high availability interrupts): interrupts) SemiSynchronous (or high-availability Caused by external event but may have to disrupt current instructions in order to guarantee service
Interrupt Priorities Must be Handled

Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE
Could b be interr rupted by y disk
N Network Interrupt
add subi slli
r1,r2,r3 r4,r1,#4 r4,r4,#2
N t th Note that t priority i it must tb be raised i dt to avoid id recursive i i interrupts! t t !
Interrupt Controller
Prior rity Encoder Inte errupt Mask M IntID Interrupt p
CPU
Int D Disable
Timer
Network
Software f Interrupt
Control
NMI
Interrupts inv invoked ked with interrupt lines fr from m devices Interrupt controller chooses interrupt request to honor Mask enables/disables interrupts p Priority encoder picks highest enabled interrupt Software Interrupt Set/Cleared by Software Interrupt identity specified with ID line CPU can disable all interrupts with internal flag Non-maskable interrupt line (NMI) cant be disabled
Interrupt controller hardware and mask levels

Operating system constructs a hierarchy of masks that reflects some form of interrupt priority. For instance: P i it Priority E Examples l 0 Software interrupts 2 Network Interrupts 4 Sound card 5 Disk Interrupt 6 Real Time clock Non-Maskable Ints (power) p This reflects the an order of urgency to interrupts For instance, this ordering says that disk events can interrupt the interrupt handlers for network interrupts.
Can we have fast interrupts?

Fine Grain I Interrupt t
add dd subi slli
r1,r2,r3 1 2 r4,r1,#4 r4,r4,#2
Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) , ( ) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE
Could d be inte errupted d by disk
P Pipeline l Drain: D Can C be b very Expensive E Priority Manipulations Register R i t S Save/Restore /R t
128 registers + cache misses + etc.
An interrupt or exception is considered precise if there is a single instruction (or interrupt point) for which: hi h
All instructions before that have committed their state No following instructions (including the interrupting instruction) have modified any state.
Precise Interrupts/Exceptions
This means, that y you can restart execution at the interrupt point and get the right answer
Implicit in our previous example of a device interrupt: Interrupt I t t point i t i is at t fi first t lw l instruction i t ti
Exte ernal Inter rrupt
add subi bi slli r1,r2,r3 r4,r1,#4 4 1 #4 r4,r4,#2
In nt handle er
lw lw add sw
r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2
Precise Exceptions in Static Pipelines
Key observation: architected state only change in memory and register write stages.
Precise interrupt point may require multiple PCs

addi r4,r3,#4 sub r1,r2,r3 r1,there PC: bne r2,r3,r5 , , PC+4: and <other insts> addi r4,r3,#4 sub r1,r2,r3 PC: bne r1,there r2,r3,r5 PC+4: and <other insts>
Interrupt point described as <PC,PC+4>
Interrupt point described as: <PC+4 there> (branch was taken) <PC+4,there> or <PC+4,PC+8> (branch was not taken)
On SPARC, interrupt hardware produces pc p (next ( pc) p ) and npc On MIPS, only pc must fix point in software
Why are precise interrupts desirable?

Many types of interrupts/exceptions need to be restartable Easier to figure out what actually restartable. happened:
I.e. TLB faults. Need to fix translation, then restart load/store IEEE gradual underflow, illegal operation, etc: e.g. Suppose you are computing: f ( x ) = Th Then, for f , 0 x 0 f (0) = NaN + illegal _ operation 0
sin( x ) x
Want to take exception, replace NaN with 1, then restart.
Restartability doesnt require preciseness. However, preciseness i makes k it a lot l t easier i to t restart. t t Simplify the task of the operating system a lot Less state needs to be saved away if unloading process process. Quick to restart (making for fast interrupts)
Precise Exceptions in simple 5-stage 5 stage pipeline:

Exceptions may occur at different stages in pipeline (I.e. out of order): Arithmetic exceptions occur in execution stage TLB faults can occur in instruction fetch or memory stage What Wh about b interrupts? i ? The Th doctors d mandate d of f d do no h harm applies here: try to interrupt the pipeline as little as possible All of this solved by y tagging gg g instructions in pipeline pp as cause exception or not and wait until end of memory stage to flag exception Interrupts become marked NOPs (like bubbles) that are placed into pipeline instead of an instruction. Assume that interrupt condition persists in case NOP flushed Clever instruction fetch might start fetching instructions from interrupt vector, but this is complicated by need for supervisor p i mode m d switch, it h saving in of f one n or m more PC PCs, etc t
Summary: Interrupts
Interrupts and Exceptions either interrupt the current instruction or happen between instructions
Possibly large quantities of state must be saved before interrupting

Machines with precise exceptions provide one single point in the program p p g to restart execution
All instructions before that point have completed p No instructions after or including that point have completed

Ee6304 MJ Lec 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ee6304 MJ Lec 7

Uploaded by

Copyright:

Available Formats

EE (CE) 6304 Computer Architecture Lecture #7 (9/18/13)

Virtual Memory y Review

640KB? A f few GB? (This laptop = 2GB)

Programmers Programmer s View

Text Data Heap

Stack A K A Vi t l Add A.K.A. Virtual Addresses 4GB

Programmers Programmer s View

Or much, much more with a 64 64-bit bit processor

CPUs CPU s View

Typical size: 4KB/page (but not always)

Physical Location may include hard-disk

Need for Translation

Physical Address Page Table Main Memory 0x00152 0x0015208B

Physical Memory 0K 4K 8K 12K 16K 20K 24K 28K

What is in a Page Table Entry (PTE)?

Three Advantages of Virtual Memory

Large L g Address Space p Support pp

Translation Look-Aside Look Aside Buffers

Translation with a TLB

Caching g Applied pp to Address Translation

What Actually Happens on a TLB Miss?

Select line whose counter is 0

Clock Algorithm: Not Recently Used

Set of all pages in Memory

1 1 0 1 Clock Algorithm: Approximate LRU (approx to approx to MIN) 0

Example: R3000 pipeline

As described, TLB lookup is in serial with cache lookup:

Reducing translation time further

Ph i l Address Physical Add

Works because offset available early

Here is how this might work with a 4K cache:

Overlapping TLB & Cache Access

Summary: TLB, TLB Virtual Memory

Exceptions: Traps and Interrupts

Exceptions: Traps and Interrupts

Exception vs vs. Interrupt

Problems with Pipelining

Example: Device Interrupt

Reenable All Ints Save registers lw addi sw r2,0(r1) r3,r0,#5 0(r1),r3

r1,r2,r3 r4,r1,#4 r4,r4,#2

Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

( (again, i f for arrival i l of f network t k message) )

Polling Point ( h k device (check d i register) i t )

Polling is faster/slower than Interrupts Interrupts.

can be controlled at user level.

A related classification: Synchronous vs. Asynchronous

Interrupt Priorities Must be Handled

add subi slli

r1,r2,r3 r4,r1,#4 r4,r4,#2

Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

N t th Note that t priority i it must tb be raised i dt to avoid id recursive i i interrupts! t t !

Interrupt controller hardware and mask levels

Can we have fast interrupts?

add dd subi slli

r1,r2,r3 1 2 r4,r1,#4 r4,r4,#2

Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

Could d be inte errupted d by disk

P Pipeline l Drain: D Can C be b very Expensive E Priority Manipulations Register R i t S Save/Restore /R t

128 registers + cache misses + etc.

r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

Precise Exceptions in Static Pipelines

Precise interrupt point may require multiple PCs

Interrupt point described as <PC,PC+4>