Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 102

Module 1

Overview: Introduction to
Computer Architecture and
Organization
Architecture & Organization 1
 Architecture is those attributes visible to the programmer
 Instruction set, number of bits used for data

representation, I/O mechanisms, addressing


techniques.
 e.g. Is there a multiply instruction?

 Organization is how architecture features are


implemented
 Control signals, interfaces, memory technology.

 e.g. Is there a hardware multiply unit or is it done by

repeated addition?
Architecture & Organization 2
 All Intel x86 family share the same basic
architecture
 The IBM System/370 family share the
same basic architecture
 This gives code compatibility
 At least backwards
 Organization differs between different
versions
Structure & Function
 Structure is the way in which
components relate to each other
 Function is the operation of individual
components as part of the structure
Function
 All basic computer functions are:
 Data processing–process data in variety of forms

and requirements
 Data storage–short and long term data storage for

retrieval and update


 Data movement–move data between computer and

outside world. Devices that serve as source and


destination of data–the process is known as I/O
 Control–control of process, move and store data

using instruction.
Functional View
Operations (b) Data
Operations (a) Data movement storage device operation –
device operation read and write
Operation (c)
Processing from/to Operation (d)
storage Processing from storage to I/O
Structure - Top Level - 1

Peripherals Computer

Central Main
Processing Memory
Unit

Computer
Systems
Interconnection

Input
Output
Communication
lines
Structure - Top Level - 2
 Central Processing Unit (CPU) – controls
operation of the computer and performs data
processing functions
 Main memory – stores data
 I/O – moves data between computer and
external environment
 System interconnection – provides
communication between CPU, main memory
and I/O
Structure - The CPU - 1
CPU

Computer Arithmetic
Registers and
I/O Login Unit
System CPU
Bus
Internal CPU
Memory Interconnection

Control
Unit
Structure - The CPU - 2
 Control unit (CU)–control the operation of
CPU
 Arithmetic and logic unit (ALU)-performs data
processing functions
 Registers-provides internal storage to the CPU
 CPU Interconnection-provides communication
between control unit (CU), ALU and registers
Structure - The Control Unit
Control Unit

CPU
Sequencing
ALU Login
Control
Internal
Unit
Bus
Control Unit
Registers Registers and
Decoders

Control
Memory

Implementation of control unit – micro programmed


implementation
Module 1
Overview: Von Neumann Machine
and Computer Evolution
First Generation – Vacuum Tubes
ENIAC - background
 Electronic Numerical Integrator And Computer
 Eckert and Mauchly
 University of Pennsylvania
 Trajectory tables for weapons (range and
trajectory)
 Started 1943
 Finished 1946
 Too late for war effort, but used determine the

feasibility of hydrogen bomb


 Used until 1955
ENIAC - diagram
ENIAC - details
 Decimal (not binary)
 20 accumulators of 10 digits
 Programmed manually by switches
 18,000 vacuum tubes
 30 tons
 15,000 square feet
 140 kW power consumption
 5,000 additions per second
Von Neumann Machine/Turing
 Stored Program concept
 Main memory storing programs and data- That could be
changed and programming will become easier
 ALU operating on binary data
 Control unit interpreting instructions from memory and
executing
 Input and output (I/O) equipment operated by control unit
 Princeton Institute for Advanced Studies
 IAS computer

 Completed 1952
Structure of von Neumann
machine
Von Neumann Architecture
 Data and instruction are stored in a
single read-write memory
 The contents of this memory is
addressable by location
 Execution occurs in a sequential fashion
from one instruction to the next
IAS – details-1
1000 x 40 bit words
Binary number – number word
2 x 20 bit instructions –
instruction word1
IAS – details-2
 Set of registers (storage in CPU)
 Memory Buffer Register–contains word to be stored in

memory or sent to I/O unit, receive a word from


memory of from I/O unit
 Memory Address Register–specify the address in

memory for MBR


 Instruction Register-contains opcode

 Instruction Buffer Register-temporary storage for

instruction
 Program Counter-contains next instruction to be

fetched from memory


 Accumulator and Multiplier Quotient-temporary storage

for operands and result of ALU operations


Structure
of IAS –
detail-3
Control unit–
fetches instructions
from memory and
executes them one
by one
Commercial Computers
 1947 - Eckert-Mauchly Computer Corporation-
manufacture computer commercially
 UNIVAC I (Universal Automatic Computer)-first
commercial computer
 US Bureau of Census 1950 calculations
 Became part of Sperry-Rand Corporation
 Late 1950s - UNIVAC II
 Faster

 More memory
IBM
 Punched-card processing equipment
 1953 - the 701
 IBM’s first stored program computer
 Scientific calculations
 1955 - the 702
 Business applications
 Lead to 700/7000 series
Second Generation:
Transistors
 Replaced vacuum tubes
 Smaller
 Cheaper
 Less heat dissipation
 Solid State device
 Made from Silicon (Sand)
 Invented 1947 at Bell Labs
 William Shockley et al.
Transistor Based Computers
 Second generation machines
 More complex arithmetic and logic
unit(ALU)and control unit(CU)
 Use of high level programming languages
 NCR & RCA produced small transistor
machines
 IBM 7000
 DEC - 1957
 Produced PDP-1
Transistors Based Computers
The Third Generation: Integrated Circuits
Microelectronics
 Literally - “small electronics”
 A computer is made up of gates, memory cells and
interconnections
 Data storage-provided by memory cells
 Data processing-provided by gates
 Data movement-the paths between components that are
used to move data
 Control-the paths between components that carry control
singnals
 These can be manufactured on a semiconductor
 e.g. silicon wafer
Integrated Circuits

Early integrated
circuits- know as small
scale integration (SSI)
Moore’s Law
 Increased density of components on chip-refer chart
 Gordon Moore – co-founder of Intel
 Number of transistors on a chip will double every year-refer
chart
 Since 1970’s development has slowed a little
 Number of transistors doubles every 18 months

 Cost of a chip has remained almost unchanged


 Higher packing density means shorter electrical paths, giving
higher performance
 Smaller size gives increased flexibility
 Reduced power and cooling requirements
 Fewer interconnections increases reliability
Growth in CPU Transistor
Count
IBM 360 series
 1964
 Replaced (& not compatible with) 7000 series
 First planned “family” of computers
 Similar or identical instruction sets

 Similar or identical O/S

 Increasing speed

 Increasing number of I/O ports (i.e. more terminals)

 Increased memory size

 Increased cost

 Multiplexed switch structure-multiplexor


IBM 360 - images
DEC PDP-8
 1964
 First minicomputer (after
miniskirt!)
 Did not need air conditioned
room
 Small enough to sit on a lab
bench
 $16,000
 $100k++ for IBM 360

 Embedded applications & OEM-


integrate into a total system
with other manufacturers
DEC - PDP-8 Bus Structure

Universal bus structure (Omnibus)-separate


signals paths to carry control, data and address
signals
Common bus structure- control by CPU
Later Generations:
Semiconductor Memory
 1950s to 1960s-memory made of ring of ferromagnetic
material called cores-fast but expensive, bulky and
destructive read
 Semiconductor memory-By Fairchild 1970
 Size of a single core
 i.e. 1 bit of magnetic core storage

 Holds 256 bits


 Non-destructive read
 Much faster than core
 Capacity approximately doubles each year
Generations of Computer
 Vacuum tube - 1946-1957 Vacuum tube
 Transistor - 1958-1964 Transistor
 Small scale integration SSI - 1965 on
 Up to 100 devices on a chip

 Medium scale integration MSI - to 1971


 100-3,000 devices on a chip

 Large scale integration LSI - 1971-1977


 3,000 - 100,000 devices on a chip

 Very large scale integration VLSI - 1978 -1991


 100,000 - 100,000,000 devices on a chip

 Ultra large scale integration ULSI – 1991 -


 Over 100,000,000 devices on a chip
Microprocessors Intel
 1971 - 4004
 First microprocessor

 All CPU components on a single chip

 4 bit design

 Followed in 1972 by 8008


 8 bit

 Both designed for specific applications

 1974 - 8080
 Intel’s first general purpose microprocessor

 Design to be CPU of a general purpose microcomputer


Pentium Evolution - 1
 8080
 first general purpose microprocessor
 8 bit data path
 Used in first personal computer – Altair
 8086
 much more powerful
 16 bit
 instruction cache, prefetch few instructions
 8088 (8 bit external bus) used in first IBM PC
 80286
 16 MB memory addressable
 80386
 First 32 bit design
 Support for multitasking- run multiple programs at the same time
Pentium Evolution - 2
 80486
 sophisticated powerful cache and instruction pipelining

 built in maths co-processor

 Pentium
 Superscalar technique-multiple instructions executed in

parallel
 Pentium Pro
 Increased superscalar organization

 Aggressive register renaming

 branch prediction

 data flow analysis

 speculative execution
Pentium Evolution - 3
 Pentium II
 MMX technology

 graphics, video & audio processing

 Pentium III
 Additional floating point instructions for 3D graphics

 Pentium 4
 Note Arabic rather than Roman numerals

 Further floating point and multimedia enhancements

 Itanium
 64 bit

 see chapter 15

 Itanium 2
 Hardware enhancements to increase speed

 See Intel web pages for detailed information on processors


Module 1
Computer System: Designing and
Understanding Performance
Designing for Performance
Microprocessor Speed
 Achieve full potential of speed if microprocessor is fed
constant data and instructions. Techniques used:
 Branch prediction-processor looks ahead in the instruction code
and predicts which branches (group of instructions) are likely to
be processed next
 Data flow analysis-processor analyzes instructions which are
dependent on each other’s result or data to create an optimized
schedule of instructions
 Speculative execution-use branch prediction and data flow
analysis, processor speculatively executes instructions ahead of
their program execution, holding the result in temporary
locations.
Performance Balance – Processor
and Memory
 Performance balance- an adjusting of
organization and architecture to balance
the mismatch capabilities of various
components (etc processor vs. memory)
 Processor speed increased
 Memory capacity increased
 Memory speed lags behind processor
speed
Performance Balance – Processor and
Memory (Performance Gap)
Performance Balance – Processor
and Memory - Solutions
 Increase number of bits retrieved at one time
 Make DRAM “wider” rather than “deeper”

 Change DRAM interface


 Include cache in DRAM chip

 Reduce frequency of memory access


 More complex and efficient cache between processor and

memory
 Cache on chip/processor

 Increase interconnection bandwidth between processor and memory


 High speed buses

 Hierarchy of buses
Performance Balance: I/O
Devices
 Peripherals with intensive I/O demands-refer chart
 Large data throughput demands-refer chart
 Processors can handle this I/O process, but the problem
is moving data between processor and devices
 Solutions:
 Caching

 Buffering

 Higher-speed interconnection buses

 More elaborate bus structures

 Multiple-processor configurations
Performance Balance: I/O
Devices
Key is Balance : Designers
 2 factors:
 The rate at which performance is changing in the
various technology areas (processor, busses,
memory, peripherals) differs greatly from one type
of element to another
 New applications and new peripheral devices
constantly change the nature of demand on the
system in term of typical instruction profile and
the data access patterns
Improvements in Chip
Organization and Architecture
 Increase hardware speed of processor
 Fundamentally due to shrinking logic gate size

 More gates, packed more tightly, increasing clock rate

 Propagation time for signals reduced

 Increase size and speed of caches


 Dedicating part of processor chip

 Cache access times drop significantly

 Change processor organization and architecture


 Increase effective speed of execution

 Parallelism
Problems with Clock Speed
and Logic Density
 Power
 Power density increases with density of logic and clock speed

 Dissipating heat

 RC delay
 Speed at which electrons flow limited by resistance and

capacitance of metal wires connecting them


 Delay increases as RC product increases

 Wire interconnects thinner, increasing resistance

 Wires closer together, increasing capacitance

 Memory latency
 Memory speeds lag processor speeds

 Solution:
 More emphasis on organizational and architectural approaches
Intel Microprocessor
Performance
Approach 1: Increased Cache
Capacity
 Typically two or three levels of cache between
processor and main memory (L1, L2, L3)
 Chip density increased
 More cache memory on chip
 Faster cache access
 Pentium chip devoted about 10% of chip area
to cache
 Pentium 4 devotes about 50%
Approach 2: More Complex
Execution Logic
 Enable parallel execution of instructions
 Two approaches introduced:
 Pipelining
 Superscalar

(covered later on)


Diminishing Returns from
Approach 1 and Approach 2
 Internal organization of processors very complex
 Can get a great deal of parallelism

 Further significant increases likely to be

relatively modest
 Benefits from cache are reaching limit
 Increasing clock rate runs into power dissipation
problem
 Some fundamental physical limits are being

reached
New Approach – Multiple
Cores
 Multiple processors on single chip
 With large shared cache
 Within a processor, increase in performance proportional
to square root of increase in complexity
 If software can use multiple processors, doubling number
of processors almost doubles performance
 So, use two simpler processors on the chip rather than
one more complex processor
 With two processors, larger caches are justified
 Power consumption of memory logic less than processing logic
 Example: IBM POWER4
 Two cores based on PowerPC
POWER4 Chip Organization
Module 1
Computer System: Designing and
Understanding Performance
(Book: Computer Organization and Design,3ed, David L. Patterson and
John L. Hannessy, Morgan Kaufmann Publishers)
Introduction
 Hardware performance is often key to the effectiveness of an entire
system of hardware and software
 For different types of applications, different performance metrics
may by appropriate, and different aspects of a computer systems
may be the most significant in determining overall performance
 Understanding how best to measure performance and limitations of
performance is important when selecting a computer system
 To understand the issues of assessing performance.
 Why a piece of software performs as it does?
 Why one instruction set can be implemented to perform better than another?
 How some hardware feature affects performance?
Defining Performance - 1

•How do we say one computer has better performance than another?


•Peformance based on speed
•To take a single passenger from one point to another in the least time – Concorde
•Performance based on passenger throughput
•To transport 450 passengers from one point to another - 747
Response Time and Throughput - 2
 As an individual computer user, you are interested in reducing
response time (or execution time)
 The time between the start and completion of a task

 Data center managers are often interested in increasing


throughput
 The total amount of work done in a given time

 Example: What is improved with the following changes?


 Replacing the processor in a computer with a faster version – will improve
response time and throughput
 Adding additional processors to a system that uses multiple processors for
separate tasks (e.g., searching the web) – will improve throughtput
Performance and Execution
Time - 3
Relative Performance - 4
 If computer A runs a program in 10 seconds and computer B runs the same
program in 15 seconds, how much faster is A than B?
 We know A is n times faster than B
Performance(A) = Execution time(B) = n
Performance(B) Execution time(B)
 Performance ratio
15 =1.5
10
 We can say
 A is 1.5 times faster than B
 B is 1.5 times slower than A

 To avoid the potential confusion between the terms increasing and decreasing,
we usually say “improve performance” or “improve execution time”
Measuring Performance
 Time – measure of computer performance
 Definition of time - Wall-clock time, response time, or elapsed time
 CPU execution time (or CPU time)
 The time the CPU spends computing for this task and does not include
time spent waiting for I/O or running other programs
 User CPU time vs. system CPU time
 Clock cycles (e.g., 0.25 ns) vs. clock rate (e.g., 4 GHz)
 Different applications are sensitive to different aspects of the performance of
a computer system
 Many applications, especially those running on servers, depend as much on
I/O performance and total elapsed time measured by a wall clock is of interest
 In some application environments, the user may care about throughput,
response time, or a complex combination of the two (e.g., maximum
throughput with a worst-case response time)
CPU Execution Time
 A simple formula relates the most
basic metrics (i.e., clock cycles and
clock cycle time) to CPU time
Improving Performance
 Our favorite program runs in 10 seconds on computer A, which has a 4 GHz
clock. If a computer B will run this program in 6 seconds given that computer B
requires 1.2 times as many clock cycles as computer A for this program. What is
computer B’s clock rate?

Clock Cycles for program A


CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A)
10 s = CPU Clock Cycles(A) / 4 GHz
10 s = CPU Clock Cycles(A) / 4 X 10*9 Hz
CPU Clock Cycles(A) = 40 x 10*9 cycles

CPU Time(B)
CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
6s = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds
Clock Rate (B) = 48 X 10*9 cycles / 6 seconds
Clock Rate (B) = 8 X 10*9 cycles / seconds
Clock Rate (B) = 8 GHz
Clock Cycles Per Instruction (CPI)
 The execution time must depend on the number of
instructions in a program and the average time per
instruction
CPU clock cycles = Instructions for a program × Average
clock cycles per instruction

 Clock cycles per instruction (CPI)


 Average number of clock cycles each instruction takes to

execute
 CPI can provide one way of comparing two different

implementations of the same instruction set architecture


Using Performance Equation-1
 Suppose we have two implementations of the same instruction set architecture and
for the same program. Which computer is faster and by how much?
 Computer A: clock cycle time=250 ps and CPI=2.0
 Computer B: clock cycle time=500 ps and CPI=1.2
 Say I = number of instructions for the program, find number of clock cycles for A
and B
CPU Clock Cycles(A) = I X CPI(A)
CPU Clock Cycles(A) = I X 2.0
CPU Clock Cycles(B) = I X CPI(B)
CPU Clock Cycles(B) = I X 1.2

 Compute CPU Time for A and B


 CPU Time(A) = CPU Clock Cycles(A) X Clock Cycle Time(A)
 CPU Time(A) = I X 2.0 X 250 ps = I X 500 ps
 CPU Time(B) = CPU Clock Cycles(B) X Clock Cycle Time(B)
 CPU Time(B) = I X 1.2 X 500 ps = I X 600 ps
Using Performance Equation-2
 Clearly A is faster. The amount faster is
the ratio of execution time.
Performance(A) = Execution time(B) = I X 600 ps = 1.2 times
Performance(B) Execution time(B) I X 500 ps

 We can conclude, A is 1.2 times faster


than B for this program
Basic Peformance Equation
 Basic performance equation

 Instruction count can be measured by using software tools that


profile the execution or by using a simulator of the architecture
 Hardware counters, which are included on many processors,
can be used alternatively to record a variety of measurements
 Number of instructions executed
 Average CPI
 Sources of performance loss
Basis Components of
Performance
Measuring the CPI
 Sometimes it is possible to compute the CPU clock cycles by looking at the different types of instructions and using
their individual clock cycle counts

 CPIi = count of the number of instructions of class i executed


 CPIi = average number of cycles per instruction for that instruction class
 n = number of instruction classes
 Remember that overall CPI for a program will depend on both the number of cycles for each instruction type and
the frequency of each instruction type in the program execution
Affecting Factors for the CPU
Performance
Comparing Code Segments-1
A compiler designer is trying to decide between 2 code
sequences for a particular computer.

Which code sequence executes the most instructions?


Which will be faster? What is the CPI for each sequence?
Comparing Code Segments-2
 Sequence 1 (Instruction Count(1))  2+1+2=5 instructions
 Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions

 CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles


 CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles

 So code Sequence 2 faster, even though it executes 1 extra instruction


 Code Sequence 2 uses fewer clock cycles, must have lower CPI

 CPI = CPU Clock Cycles/Instruction Count


 CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2
 CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5
Module 1
Computer System: Computer
Components and Bus
Interconnection Structure
Von Neumann Architecture
 Data and instruction are stored in a
single read-write memory
 The contents of this memory is
addressable by location
 Execution occurs in a sequential fashion
from one instruction to the next
Program Concept
 Hardwired program-connecting/combining
various logic components to store data and
perform arithmetic and logic operations
 Hardwired systems are inflexible
 General purpose hardware can do different
tasks, given correct control signals
 Instead of re-wiring, supply a new set of
control signals
What is a program?
 A sequence of steps
 For each step, an arithmetic or logical
operation is done
 For each operation, a different/new set of
control signals is needed
 For each operation a unique code is provided
 e.g. ADD, MOVE
 A hardware segment accepts the code and
issues the control signals
Hardware and Software
Approaches
Components
 The Control Unit and the Arithmetic and Logic
Unit constitute the Central Processing Unit
 Data and instructions need to get into the
system and results out
 Input/output
 Temporary storage of code and results is
needed
 Main memory
Computer Components:
Top Level View
Interconnection Structures
 All the units (processor, memory and
I/O components) must be connected –
interconnection structure
 Different type of connection for
different type of unit
 Memory
 Input/Output
 CPU
Computer Modules
Memory Connection
 Receives and sends data
 Receives addresses (of locations)
 Receives control signals
 Read
 Write
 Timing
Input/Output Connection(1)
 Similar to memory from computer’s
viewpoint
 Output
 Receive data from computer
 Send data to peripheral
 Input
 Receive data from peripheral
 Send data to computer
Input/Output Connection(2)
 Receive control signals from computer
 Send control signals to peripherals
 e.g. spin disk
 Receive addresses from computer
 e.g. port number to identify peripheral
 Send interrupt signals (control)
CPU Connection
 Reads instruction and data
 Writes out data (after processing)
 Sends control signals to other units
 Receives (& acts on) interrupts
Buses Interconnection
 There are a number of possible
interconnection systems
 Single and multiple BUS structures are
most common
 e.g. Control/Address/Data bus (PC)
 e.g. Unibus (DEC-PDP)
What is a Bus?
 A communication pathway connecting two
or more devices
 Usually broadcast/shared medium
 Often grouped
 A number of channels in one bus
 e.g. 32 bit data bus is 32 separate single bit
channels
 Power lines may not be shown
Bus Interconnection Scheme
Data Bus
 Carries data
 Remember that there is no difference
between “data” and “instruction” at this
level
 Width is a key determinant of
performance
 8, 16, 32, 64 bit
Address bus
 Identify the source or destination of data
 e.g. CPU needs to read an instruction
(data) from a given location in memory
 Bus width determines maximum memory
capacity of system
 e.g. 8080 has 16 bit address bus giving 64k
address space
Control Bus
 Control and timing information for data
and address line-
 Typical control lines:
 Memory read/write signal
 Interrupt request
 Clock signals
 Bus grant/request
Big and Yellow?
 What do buses look like?
 Parallel lines on circuit boards
 Ribbon cables
 Strip connectors on mother boards
 e.g. PCI
 Sets of wires
Physical Realization of Bus
Architecture
Single Bus Problems
 Lots of devices on one bus leads to:
 Propagation delays
 Long data paths mean that co-ordination of
bus use can adversely affect performance
 If aggregate data transfer approaches bus
capacity
 Most systems use multiple buses to
overcome these problems
Traditional (ISA)
(with cache)
High Performance Bus
PCI Bus
 Peripheral Component Interconnection
 Intel released to public domain
 32 or 64 bit
 64 data lines
 High bandwidth, processor independent
bus
PCI Bus

You might also like