Advanced Computer Architecture (ACA) /lecture: LV64-446, Module MV5.1

Advanced Computer Architecture (ACA)/Lecture
LV64-446, Module MV5.1

SS 2011
Advanced Computer Architecture (ACA)/Lecture

LV64.446, Module MV5.1
Prof. Dr.-Ing. Dietmar P. F. Mller
Advanced Computer Architecture (ACA)
Lecture Organization 1. Local and Global Concepts for Processors

(Module MV 5.1) 1.1 Digital Computers History 1.2 Introduction 1.3 Computer Programs and Instructions 1.4 Processor Fetch, Decode and Execute Cycle 1.5 VHDL Processor Model 1.6 Simulation of the VHDL Processor Model 1.7 More
Advanced Computer Architecture (ACA) 1.1 Digital Computers History

Development of digital computers: based on processors, stepwise process that become more and more technologically costly, lead from 1st generation 4-bit processors to todays 64-bit processor generation 1st Generation Processors: Intel (Intel = Integrated Electronics) in 1971 introduced the 4004 processor, on a 4-bit CPU that can be viewed as this developments origin. This 4-bit CPU had an ALU and a bi-directional data bus with 4 bits wide while the address bus was 12 bits wide. Instruction set encompassed 45 commands and the arithmetic was carried out in BCD (Binary Code Decimal) since the primary application areas for these processors were desktop computers.
1st generation of Microprocessors deal with integration of functions on one chip

Intel 4004 1st P. Other examples are RCA 1802 and MOSTEK 6502, based on Motorola's 6800.
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processors:
INTEL 4004: 4-bit P contains 2000 transistors capable processing 4 bits of information at a time at a rate of ~0.06 MHz market launch 1971. Two patents cover Intel's P 4004: Patent # 3,821,715, Memory System for a Multi-Chip Digital Computer in the names of Ted Hoff, Stan Mazor and Federico Faggin Patent# 3,753,01, Power supply settable, bi-stable circuit in the name of Federico Faggin.
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processor Intel 4004:
White/Gold CerDIP, 16-pin & Architecture
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processor Intel 4004: Microchip
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processor:
http://www.intel.com/museum/archives/4004.htm
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 2nd Generation Processors:
Intel 8008: 1st 8-bit P, implemented as TTL logic chip, architecturally different than 4004, with a 16-bit-wide address range, and 8 bit wide data words, capable of processing 8 bits of information at a time at a rate of ~ 0.8 MHz, market launch 1972. Instruction set encompassed 48 commands. Reason for the 8008 was the necessity to develop a keyboard controller that encompassed 7 or 8 bits, which was twice the 4004s processing range. Scopes of the 2nd generation are: Increase of address space up to 16 Bit (65.536 Memory bit), Optimizing clock cycles of the instructions, Add on interrupt facilities through interrupt request and sub routines This features are not available with the market launch of the Intel 8008, they become available with the Intel 8080 (1974). 8008 was an important transition CPU for Intel, work on the 8008 enabled the creation of the 8080 (which included the 8008 instruction set).
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 2nd Generation Processor Intel 8008:
http://www.antiquetech.com/chips/8008.htm
8008 chip mounted in "C" package
8008 Close-up
C8008Gray CerDIP, 18-pin
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 2nd Generation Processors:
1974 the NMOS-Technology (2 supply voltages) was introduced, in some special cases also CMOS-Technology was available. 1st NMOS processor was the 6800 from Motorola. 1976 the Intel 8080A in NMOS-Technology was available and later on the 8085 which was not a successor hence the 8086 was developed which expand the 8080 design from 8 to 16 Bit. Hence two different sources are available, for this Generation, the 8080 for the 80386, and the 6800 for the 68030.
Summarizing the 2nd generation: introduction of 8 bit word length handling of asynchronous generated interrupts, which are important for process control expanding the instruction sets and address modus up to an international accepted standard
1.1 Digital Computers History

3rd Generation Processors:

development of processors and controllers had been separated partly because of this, the 16-bit processing range was introduced and the existing 8-bit technology was improved.
The 16-Bit-Technology was introduced in 1973 with the IMP-16 Processor from National Semiconductor. 1975 the 16-Bit One-Chip Processor PACE in PMOSTechnology was introduced as well as the NMOS Version, the INS8900. Introducing of 16-Bit word length has the big benefit, that optimization strategies for solutions, based on 8 bit word length, are no longer in the focus, this is while performance and velocity are much important. This result in the implemented and processed 16-Bit Variables in the Definition of C, based on the variable concepts , like SHORT INTEGER or INTEGER.

1976 the TMS9900 from Texas Instruments was introduced, 1978 as first the common used 16-Bit-Processor 8086 from Intel, 1979 the Z8000 from Zilog, and the 68000 from Motorola,
TMS9900JDL in ceramic package
8086 in 40 Pin DIP Package
Zilog Z8002 in 40 Pin DIP Package
Motorola MC68000 in 64 Pin DIP Package

The TMS9900 was one of the first 16-Bit processors, with a 15-Bit address bus, a 16-bit data bus, and 3 internal 16-bit registers (PC, WP, and ST). With the introduction of the HMOS process (High Density Metal Oxide Semiconductor) in 1978, Intels 16-bit processor 8086 had twice the packaging density for transistor elements than previously introduced NMOS processors. With the successor of the 8086, the 8088, released in 1979, Intel had brought a processor on the market whose distribution through the success of IBMs XT PC family was immense The Z8000 and 68000 are in competition to the Intel 8086. But the Intel 8088, which is based on the 8086 with an external 8-Bit-Bus, which is internally totally compatible with the 8086, was the one which was the first common used P within the IBM PC line.

The Zilog Z8000 and the Motorola 68000 were launched into the market in 1979 are in competition with the Intel 8086, which gave rise to the x86 architecture. Originally the 8086 was intended as a temporary substitute for the iAPX 432 project in an attempt to draw attention from the less-delayed 16 and 32-bit processors of other manufacturers The 68000-Family from Motorola was designed brand new CISC processor core, without any compatibility aspects, while Intel introduced a soft change, allowing compatibility between the processor lines of the 8- and 16-Bit-world, which as result in the instruction sets and the programming models, used, e.g. the segmentation of the Memory into 64 kByte big segments. The 68000 implements a 24-bit address bus, allowing it to address up to 16 MB of physical Memory. The member of the 68000-Family introduced a new bus-system, the asynchronous Bus, which include an indication of the CPU phase concept.
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 3rd Generation Processor Intel 8086:
http://en.wikipedia.org/wiki/Microprocessor
Die of an Intel 80486DX2 microprocessor (actual size: 126.75 mm) in its packaging
Intel 8086, 40 pin DIP (Dual in-line package)

3rd Generation Processors: Summarizing the 3rd generation:
no standardized development, new design of 16 Bit microprocessors and further development of 8 Bit microprocessors The 16 Bit microprocessors show operating system capability at the very first, especially with the multi-tasking.concept The 8 Bit microprocessors become an industrial standard which was expanded by integrating peripherals like serial and/or parallel interface The very first steps of prototyping the micro-controller take place

4th Generation Processors:
The 4th generation of processors was initiated in 1981 by Intel iAPX432 the first 32-Bit processor, which was manufactured in HMOS-Technonology as 3-Chipsolution, with 219.000 transistors, which was around 8 times more compared with the 8086 with 29.000 transistors. The address space hold 16 MByte, like the 68000 from Motorola. The real step towards the 4th generation of processors was done 1984 with the Intel 80386 , the Motorola 68020, and the National Semiconductor 32032 and developed further by the succeeding types 80486, 68030, 68040, 32332 and 32532. The 4th generation had a 32-bit standard word length and, above all, integrated functions for supporting operating systems, like virtual memory management, cache memory, cache controllers, the concept of the virtual machine, the task switching concept to support, especially to support multitasking, and the so called authorization concept.

The first generation of digital signal processors were introduced on the market concurrently with the fourth generation due to a swiftly growing specialized market for signal processing systems. One of these signal processing systems was the Motorola DSP 56000, a 56-bit signal processor with an external 24-bit data bus.
http://www.soe.uoguelph.ca/webfiles/rdony/ ENGG3390/DSP56000.pdf

Summarizing the 4th generation: 32 Bit word width supporting operating system capability, especially the multi-tasking/multiuser concept design of the 1st generation of digital signal processors with inter arithmetic, 56 Bit word width

The technological developments that led to the 5th generation of processors were essentially the expansion of the word length to 64 bits, the optimization of instruction cycle times by converting superscalar structure concepts which allow the execution of more than one code per bus cycle or the RISC concept with internal parallelism and pipelining. This generation processors also feature integer and floating-point units that can work concurrently.
The prominent exponent of this 5th generation of processors was the Pentium processor, launched into market 1993
Summarizing the 5th generation: improvement of the instruction times embedding of additional units on the chip

Through further technological developments, the 6th generations processing speed increased drastically so that clock cycles > 1 GHz have today become a matter of fact. This also results in better instruction execution times. These multimedia processors are marked by the integration of extra units in the chip. The development of high-performance processors, in particular for use in mobile systems like laptops, led to marked shifts within individual market segments. There are many prominent exponents of this 6th generations processing from Intel like Pentium Pro, launched into market 1995 Pentium II, launched into market 1996 Pentium MMX, launched into market 1997 Celeron, launched into market 1998 Pentium 3, launched into market 1999

Intel Pentium Pro Processor (1995) used for fuel 32-bit server and workstation applications, enabling fast CAD, mechanical engineering and scientific computation; packed with second speed enhancing cache memory chip; bears 5.5 million transistors Intel Pentium II Processor (1997) designed specifically to process video, audio and graphics data efficiently, bears 7.5 million-transistor, incorporates MMX technology Intel Pentium II Xeon Processor (1998) designed to meet the performance requirements of mid-range and higher servers and workstations Intel Pentium Celeron Processor (1999) designed to meet PC market segment; provides high performance at exceptional price; delivers excellent performance for uses such as gaming and educational software Intel Pentium III Processor (1999) designed to significantly enhance Internet applications, allowing users to browse through realistic online museums and stores and download high-quality video features; processor incorporates 9.5 million transistors, and 70 new instructions, 0.25-micron technology was introduced

Processor performance increase through clock cycle
Name
Intel Celeron & AMD Duron 2 Intel Mobile Pentium 4 & AMD Opteron Intel Pentium, 4 & Intel Xeon & AMD Athlon XP Intel 8086 (1986)
Clock Cycle
~ 2GHz
Application Area
cheap and fast processors for office applications processors for notebooks with low power High performance processors for CAD, Server, Games, etc.
~ 2,5 GHz
~ 3GHz
4,66 MHz
DOS-Word processing Accounting -PC
1.1 Digital Computers History Moore's Law

Its impact onto the last 6 generation's of Ps

Moore's Law
Gordon Moore Co-Founder of Intel Observed (1965): due to acceleration of technology development computing power doubles every 18 months, bandwidth doubles twice as fast, connections grow exponentially with each additional node so that at least today most people are enmeshed in innumerable networks
Advanced Computer Architecture (ACA) Moore's Law result in 109 T

GB 100 GHz
10
1GHz MB 100
This refers to the cycles per second of the main clock of the CPU
2 ... 3 GHz clock speed CPU faster clock speeds are elusive
10
KB
104
T 1975
1MHz
2005
If the speed GHz were to be a car then the cache is the traffic light. No matter how fast the car goes it still will not hit that green traffic light. The more speed and the more cache is available the faster the processor is
Advanced Computer Architecture (ACA) Clock speeds > 2 GHz are elusive
Assuming 1GHz: Light in vacuum travel in 1 clock cycle ca. 30cm Assuming 10GHz: 3cm Semiconductor: Assuming 10GHz: 3mm.
Taktzyklus
Advanced Computer Architecture (ACA) Clock speed > 2 GHz are elusive
Assuming 1GHz: Light in vacuum travel in 1 clock cycle ca. 30cm Assuming 10GHz: 3cm
Semiconductor: Assuming 10GHz: 3mm.
Taktzyklus
Facts limiting clock speed
Facts limiting clock speed: time delay through micro-wires
Facts limiting clock speed: wire time delay
Metal wire, Al, CU

Isolator (SiO2) Substrate (Semiconductor)
Facts limiting clock speed: wire time delay; capacitor
Isolator
Capacity of the wire:
F C 0 r d
F
wire
Isolator ( r ) Substrat
Resistance of the wire:
l R A
l
wire
U C U 0 (1 e
t RC
)
U
t
U
RC
R 0
t
Advanced Computer Architecture (ACA) Facts limiting clock speed: wire time delay assuming half of length of l structure lneu 2 l Resistance of reduced wire length: R A Aneu A 2
Capacity of reduced wire length:
C 0 r
Fneu F 0 r d 4d
Fneu
lneu
wire (new)
wire
wire time delay

Doubling the components the structural area decrease by factor
2
Resistance of reduced wire length :
lneu R Aneu
l 2 l A A 2
Capacity of reduced wire length:
Fneu F C 0 r 0 r d 2d

Through further technological developments, the 7th generations processing performance increased so that clock cycles > 1 GHz today are state of the Art, but in combination with a dual core kernel. This results in better instruction execution times.
These CPUs can cache multiple instructions per clock cycle, which dramatically speeds up a program. Other factors influence speed, like the mix of functional units, bus speeds, available memory/cache, length of pipeline with 32 kernels and type and order of instructions in the programs being run. Development of HIT (hyper threading technology) with more internal data processors with second register set with independent I/O logic. For the operating system this processors can be operated as dual core architecture
Dual Core

Through further technological developments, the 8th generations processing performance increased so that clock cycles > 1 GHz today are state of the Art, but in combination with a multi core kernel. This results in better instruction execution times. The problem with this processor kernel are the over proportional increase of leakage currents with increased clock speed. The Pentium 4 Extreme Edition consumes 130 Watt which can not be cooled down to allow higher clock speed, which originally was announced. Henceforth, the future lay in more parallel working units like the AMD Athlon which ha a less high clock speed
Multi Core, launched into market 2006
Impact of Moore's Law / Cache Memory

109 T
GB 100 GHz
Clock Speed (Frequency)

10 1GHz MB 100
2 ... 3 GHz CPU 300 ... 400 MHz Memory
10
KB
104
T 1975
1MHz
2005
Cache Memory
CPU Reg.
Caches Main Memory Bulk Memory (Disc) Archive Memory (Streamer) Capacity Access Time
Cache Memory
Moore's Law
Year Transistors 2007 2 * 109 2009 4 * 109
CPU Reg.
2011 8 * 109
2013 16 * 109 2015 32 * 109 2017 64 * 109 2019 128 * 109
Caches Main Memory Bulk Memory (Disc) Archive Memory (Streamer)
2021 256 * 109
Architecture of the future

Static (fast) memories substitute slow dynamic memories
Floor space gain in processors Higher speed Less el. Power Intel
Higher floor space required
Cache Memory
Cache Memory
Memory Access Cache small but fast memory, close to the processor
Fast / expensive technology normally on the same chip Processor
keep copy of parts of the memory

Cache
Memory
Slower / cheaper technology normally on different chips

Resume:
Standard processors: with standard bus systems that do not have optimized instruction sets for specialized applications. These are conditions with the help of suitable programs for the respective application domain Special processors: among which are digital signal processors which are specialized for the digital processing of analog signals in the optimal time, as well as slave processors and coprocessors that have well outlined instruction sets that are suited to the problem and are also suited for special tasks like arithmetic or graphic tasks within the flow of a program. Controllers: that have special interfaces, outside of the normal processor busses, for handling certain recurring tasks, e.g., a keyboard controller.

Resume:
Microcomputers: which, along with the processor, connect the central processing unit, data and program memory, real-time clocks, AD and DA converters, counters and interfaces to external peripheral devices. The address, data and control bus serves as the connector between the components. Single-chip microcomputers: which represent the general microcomputers special cases with clock generators, memory, interfaces, counters etc., all integrated on one chip. Single-board computers: in which all microcomputer components are completely laid out on one board and which represent the micro-computers further special cases. Microcomputer systems: which represent freely programmable systems which build off of a microcomputer and its peripheral devices.
1.2 Introduction
Digital Computers consist of three main components: processor or central processing unit (CPU), memory that stores program instructions and data, input/output hardware that communicates to other devices. The CPU consist of two main components: arithmetical logical unit (ALU), control unit (CU). CU is a complex state machine that control the internal operation of the digital processor
1. 2 Introduction
CPU, CU, Memory, and I/O are linked by an electrical highway, a communication interface for data transmission, called Bus. Typically, signals on the bus include: memory address, memory data, bus status.
Bus status signals indicate the current bus operation: memory read (MR), memory write (MW), input/output operation (I/O).
1. 2 Introduction
PC: Program Counter IR: Instruction Register AC Accumulator ALU: Arithmetical Logical Unit MAR: Memory Address Register MDR: Memory data register
1.2 Introduction
Internally CPU contains a small number of registers - built up by using so called D-flip-flops for data storage - which are used to store data inside the processor
Remember: D flip-flop (see LV 18.003, Module IP 7.3) Output always takes on the state of D input at the moment of a rising clock edge, and never at any other time. Flip flop is called D flip-flop for this reason, since the output takes the value of the D input or Data input, and Delays it by one clock count. D flip-flop can be interpreted as a primitive memory cell. Truth table: ('X' denotes a Don't care condition, meaning the signal is irrelevant) Symbol D-flip-flop
Clock Rising edge Rising edge Non-rising D 0 1 X Q 0 1 constant Qprev x x
Modern processors contain at least one or more arithmetic and logical units (ALU) inside the CPU. ALU is used to perform arithmetic and logical operations on data values.
1.2 Introduction
ALU operations include (at minimum): add, subtract, logical and/or operations shift Register to bus connections hard wired for simple point-to-point connections. If one of several registers can drive the bus, the connections are constructed using: multiplexing, open collector outputs, Tri-state outputs
1.3 Computer Programs and Instructions

Computer program is a sequence of instructions that perform a desired operation. Instructions are stored in memory.
Example 1: Instruction may contain 16 bits. High eight bits of the instruction contain the op code, which specify the operation, such as add, sub, etc., that will be performed by the instruction. Typically an instruction send one set of data values through the ALU to perform this operation. Low eight bits of each instruction contain a memory address field.
158 70 op code address

Depending on the op code the address may point a data location or the location of another instruction as shown in the following example
Example 2: Basic Processor Instructions Instruction ADD STORE LOAD JUMP JNEG Mnemonic Operation Performed Hypothetic op code value
address AC AC + content of memory address address content of memory address AC address AC content of memory address address PC address address If AC < 0 then PC address
00 01 02 03 04

Example 3.1: Computer program in assembly language and machine language to compute A = B + C may have a sequence of 3 instructions: Assembly Language LOAD B ADD C STORE A Machine Language 0201 0002 0103
Program variables A, B, and C are typically stored in dedicated memory locations. Symbolic representation of the instructions (shown in first column) is called assembly language Symbolic representation based on the assembly language program (shown in second column) is called machine language, representing the binary pattern that is actually loaded into the computer's memory. Machine language can be derived using the given instruction format. op code representation for each instruction is shown in the first column, which provides the first two hexadecimal digits in machine language. Second assign the data value of A, B, and C to be stored in hexadecimal addresses 01, 02, 03 in memory. Address provides the last two hexadecimal digits of each machine instruction

Example 3.2: Assignment of data must not conflict with instruction addresses. Under normal circumstances data stored in the memory after all of the instructions in the program. Assuming that program starts at address 0. Hence, the three addresses of program example 3.1 will use memory addresses 0, 1, 2. Instructions of program example 3.1 all perform data operations and execute in strictly sequential order. Instructions just as JUMP and JNEG are used to transfer control to a different address. Instructions JUMP and BRUNCH do not execute in strictly sequential order, they are used to implement control structures such as an IF.THEN statement or program loops.
Assemblers: computer programs that automatically convert the symbolic assembly language into the binary machine language.
Compilers: programs that automatically translate higher-level languages, such as C, C++, etc. , into a sequence of machine instructions.
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle
Processors reads or fetches an instruction from the memory, decodes the instruction to determine what operations are required, and then executes the instruction:
Fetch next InstructionDecode InstructionExecute Instruction
Implementing fetch, decode, and execute cycle requires several register transfer operations and clock cycles. A specific state machine (control unit CU) controls sequence of operations within the processor. PC contains address of current instruction. Fetching next instruction from the memory the processor must increment the PC. Hence, processor must send the address in the PC to memory over the bus by loading the MAR and start a memory read operation on the bus. Instruction data will appear on the memory data bus lines, and will be latched into the MDR . Execution of the instruction may require an additional memory cycle; instruction is saved in CPU IR. Using the value of the IR, the instruction can now be decoded. Execution of instruction will require additional operations in the CPU, e.g. additional memory operations
Advanced Computer Architecture (ACA) Instruction pathway within the CPU
SAR = MAR SPR = MDR Steuerwerk = control unit Speicher = Memory
SAR = MAR SPR = MDR
1.4 Processor Fetch, Decode and Execute Cycle

CPU contains a general purpose register, the accumulator AC, and the PC. AC is the primary register used to perform data calculation and to hold temporary program data in the processor. After completing execution of the instruction the processor begins the cycle again fetching the next instruction. The fetch, decode, and execute cycle are implemented in a computer using a sequence of register transfer operations. Hence the next instruction can be fetched from the memory based on the following register transfer operation:
MAR = PC Read Memory, MDR = Instruction value from memory IR = MDR PC = PC +1

After this sequence of register transfer operations
MAR = PC Read Memory, MDR = Instruction value from memory IR = MDR PC = PC +1
the current instruction is hold in the IR. This instruction is one of the possible machine instructions such as ADD, LOAD, STORE, etc.
The op code field is proved to decode the specific machine instruction.

The address field of IR contains the address of possible data operands. Using the address field, a memory read is started in the decode phase.

The decode state transfers control to one of several possible next states based on the op code value. Each instruction requires a (short) sequence of register transfer operations to implement or execute that instruction. These register transfer operation are then performed to execute the instruction.
When execution of the current instruction is completed, the cycle repeats by starting a memory read operation (MR) and returning to the fetch state. A state machine, so called control unit (CU), is used to control these internal processor states and control signals.
Data path used for implementation of the processor kernel, consisting of Registers, Memory interface, ALU, Bus structures that are used to connect them. Three busses are used to connect the registers: Address Bus, Data Bus, Control Bus. On the bus lines a slash / with a number indicates the number of bits (width) on the respective bus. Data values present on the active busses are shown in hexadecimal numbers.
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle The change of data in registers register transfer
Load
Read memory location into register
Processor Control Unit Data Path ALU Controller Control /Status
+1
ALU Operation
Input certain registers through ALU, store back in register
PC 000 IR 000
Register
00
01
Store
Write register to memory location
I/O Memory
... ...
00 01
Control Unit: configures the data path operations
Sequence of desired operations (instructions) stored in memory program
Processor Control Unit Data Path ALU Controller Control /Status Register
Instruction cycle broken into several sub-operations, each one clock cycle, e.g.:
Fetch: Get next instruction into IR Decode: Determine what the instruction means Fetch operands: Move data from memory to data path register Execute: Move data through the ALU Store results: Write data from register to memory
PC
IR
R0
R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory
...
100 10 101
...
FETCH: Register Transfer Cycle 1:
Fetch

Get next instruction into IR PC: program counter, always points to next instruction IR: holds the fetched instruction
PC
001
IR load R0, M[100]
R0
R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory 100 101
...
10
...
CU Sub-Operation Decode
Processor Control Unit Data Path ALU Controller Control /Status Registers
Determine what the instruction means
PC
001
IR load R0, M[100]
R0
R1
...
10
...
CU Sub-Operation Fetch operands
Processor Control Unit Data path ALU Controller Control /Status Registers
Move data from memory to data path register
10
PC 001 IR load R0, M[100] R0 R1
...
10
...
CU Sub-Operation Execute

Move data through the ALU This particular instruction does nothing during this sub-operation
10
PC 001 IR load R0, M[100] R0 R1
...
10
...
CU Sub-Operation Store results

Processor Control Unit Data path ALU Controller Control /Status Register
Write data from register to memory This particular instruction does nothing during this sub-operation
10
PC 001 IR load R0, M[100] R0 R1
...
10
...

After considering this introductory example, it should be obvious that a thorough understanding of each instruction, the hardware organization, busses, control signals, and timing is required to design computer architectures, as well as advanced computer architectures, while some operation can be performed in parallel, while others must be performed sequentially and others have to be scalable. A bus can only transfer one value per clock cycle and von Neumann ALU can only compute one value per clock cycle, so ALU, bus structures, and data transfers will limit those operations that can be done in parallel during a single clock cycle. In the states examined, three busses were used for register transfers. Timing in critical paths, such as ALU delays and memory access times, will determine the clock speed at which these operations can be performed.
A multiple clock cycles per instruction implementation approach was used in early generation processors. These processors hat limited hardware, since the VLSI technology at that time supported orders of magnitude fewer gates on a chip than now is possible in current devices. Current generation processors, such as those used in PCs, have a hundred and more instructions, and use additional means to speedup program execution. Instruction formats are more complex with up to 32 data registers and with additional instruction bits that are used for longer address fields and more powerful addressing modes, as mentioned before.

Pipelining converts fetch, decode, and execute into a parallel operation mode instead of sequential. As an example, with three stage pipelining: the fetch unit fetches instruction n+2, while the decode unit decodes instruction n+1, and the execute unit executes instruction n. With this faster pipelined approach, an instruction finishes execution every clock cycle rather than three as in the simple von Neumann architectural design concept, introduced.
Advanced Computer Architecture (ACA) Cooperating sequential logic circuit Pipelining: Enhance instruction throughput
Washing Drying
8 1 2 3 4 5 6 7 8
2 1
3 2
4 3
5 4
6 5
7 6
8 7 8
No-Pipelining
Pipelining
No Pipeling for Dishwashing
Time
Pipelining for Dishwashing
Time
Fetch-instr. Decode Fetch ops. Execute Store res.
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2
7 6 5 4 3
8 7 6 5 4 8 7 6 5 8 7 6 8 7 8 Pipelined
Instruction 1
Pipelining Instruction throughput
Time

Superscalar processors are pipelined architectures that contain multiple fetch, decode and execute units. Superscalar computers can execute several instructions in one clock cycle. Most current generation processors including this in PCs are both pipelined and superscalar. An example of a pipelined architecture is the reduced instruction set architecture (RISC).
To demonstrate the operation of a computer architecture, a VHDL model of it has to be build up and analyzed by simulation. The simple computer example introduced shall be used to demonstrate a simple VHDL design. As a surprise it may be noted that this design fits easily into a FLEX 10K20 device, a PLD of Altera.
Advanced Computer Architecture (ACA) 1.5 VHDL Model of a Processor

The name VHDL, which is the abbreviation for Very High Speed Integrated Circuits Hardware Description Language, induces some confusion. It was initiated for the description of not available Integrated Circuits (IC), which were already in development or should be developed in the future. Description in this context means documentation, while the description for development was performed by other methods, e.g. graphical environments. Timetable for VHDL: 1970s: VHDL was discussed as a documentation standard inside military programs (DoD) to reduce the costs for service and maintenance. 1982: IBM, TI and Intermetrics were contracted for developing the Hardware Description Language. The first standard: IEEE 1076-1987 1993: The second (and still valid) standard: IEEE 1076-1993 1998: The new standard was introduced: IEEE 1076.1 (also called VHDLAMS for Analogue and Mixed Signals) VHDL was developed at the very first for the description and documentation of integrated circuits, not for the development!

On the side of available tools, there are tools for documentation and simulation for workstations as well as Personal Computer (except VHDL-AMS). Opposite the handling of synthesizing tools contains some problems: The complete standard is not synthesizable, and a subset for this purpose is not defined. This generates the possibility, that someone may write correct VHDL code, which will go through simulation without any problem and may be synthesized by compiler I, while compiler II will reject the code or generate a digital system three times larger than compiler I. A first VHDL example is introduced to show the procedure describing a digital circuit using this language. The example consists of a clocked buffer for four bits. The output buffers will have the states 0, 1 and high impedance (Z), depending on the output enable control line reg_oe. The four bit input bus is loaded into the output register by a positive edge at the clock line reg_clk.
LIBRARY ieee; USE ieee.std_logic_1164.ALL; ENTITY reg4b IS PORT( reg_in reg_out reg_clk reg_oe END reg4b; -- Edge-cloecked register for 4 bits including output enable : IN std_logic_vector (3 DOWNTO 0); -- Input to the register : OUT std_logic_vector (3 DOWNTO 0); -- Output : IN std_logic; -- Clock signal for storage : IN std_logic ); -- and output enable
VHDL source code: Entity for reg4b, a clocked buffer with a width of four bits
1.5 VHDL Model of a Processor

The first part is called ENTITY (VHDL in most cases is insensitive to lower or upper cases). ENTITY describes the interface of the new part to the outside world and contains a GENERIC declaration for configuration variables (not shown in the example) and a PORT declaration for the interfaces. This source code contains the interface to three (!) libraries: STD and WORK are by default always inserted, and the LIBRARY ieee is explicit mentioned. Inside this library there are many so-called packages, and by the instruction USE ieee.std_logic.ALL, the PACKAGE std_logic and all (ALL) there defined types, functions and procedures are available to the compiler. ENTITY has the name reg4b, which is from now on unique to this part. The frame for the ENTITY description is given by the keyword ENTITY and the END reg4b in the last line.

PORT instruction contains four signals (they are signals, but keyword SIGNAL is not used here), each one assigned with a unique name, an object type (std_logic or std_logic_vector) and a mode (IN resp. OUT). The mode determines, whether the signals are read only (IN) or write only. Other modes are INOUT (read and write accessible) and BUFFER (read accessible from all, write only by one source). Object types std_logic and std_logic_vector are not declared in the VHDL standard IEEE 1076 but in the IEEE 1164 additional standard. The predefined object type inside VHDL is BIT (values 0 or 1) or BOOLEAN (true or false). For the purpose of this description, the pure logic values are not sufficient, and the IEEE 1164 standard, which is programmed in the library ieee, a nine value object type std_logic is predefined. This object type contains the 0 and 1 as well as the Z and other values.

ARCHITECTURE behave_reg4b OF reg4b IS SIGNAL reg_internal: std_logic_vector (3 DOWNTO 0) := 0000; BEGIN reg4b_storage: PROCESS( reg_in, reg_clk) BEGIN IF rising_edge( reg_clk ) THEN reg_internal <= reg_in; ELSE NULL; -- No action requested END IF; END PROCESS reg4b_storage; reg_4b_output: PROCESS( reg_oe, reg_internal) BEGIN IF reg_oe = 0 THEN reg_out <= reg_internal; ELSE reg_out <= (OTHERS => Z); END IF; END PROCESS reg4b_output; END behave_reg4b;
Architecture of reg4b

Only one ENTITY exist for a complete description of a digital part using VHDL, many ARCHITECTURES may exist. This was defined in the language standard to enable the design engineer to produce several descriptions for one part using different levels of abstraction and views. Previous code show ARCHITECTURE for the ENTITY reg4b, which is called behave_reg4b. ARCHITECTURE contains a local SIGNAL called reg_internal, which stores the input at every rising edge at reg_clk, and two PROCESSes. Both PROCESSes are rather independent from each other. PROCESS reg4b_storage describes the storage using the predefined function rising_edge() (IEEE 1164) and is sensitive to reg_in and reg_clk. This sensitivity, formal described in the sensitivity list, means, that any change on these signals will start a new computation of this process inside simulation.

PROCESS reg_output, sensitive to reg_oe and reg_internal (coupling between both processes), describes the output behavior of ARCHITECTURE. The assignment (OTHERS => Z) describes, that all members of the std_logic_vector are set to high impedance (Z). This is part of the named association of value assignment and shows in addition that assignment to complete vectors are allowed inside VHDL. Both PROCESSes and all instruction outside any PROCESS will be simulated in concurrence, but inside one PROCESS, the computation will be performed in sequential order which is an important part of the simulation concept.

Entity: One Interf ace
Black Box
Architecture: One or more Descriptions and Sty les
Structural Description A1
Behavioural Description A2
Combined Behavioural and Structural Description A3
Def ault: Last compiled Version
Configuration: Choice of a Conf iguration
Configuration C1 Configuration C2

The Keywords to the Frame Structure of a VHDL Description are:

ENTITY: The interface to the world outside is defined here. ARCHITECTURE: The functionality inside is described here. CONFIGURATION: The actual wanted configuration (if there are more than one possible) may be defined here. PACKAGE / PACKAGE BODY: Here often used declarations, functions etc. may be defined. This is an important part of the library system of VHDL
ENTITY halfadder IS PORT( a0, b0: IN bit; s0, c1: OUT bit ); END ENTITY; ARCHITECTURE behave_halfadder OF halfadder IS BEGIN s0 <= a0 XOR b0; c1 <= a0 AND b0; END;
ARCHITECTURE structural_halfadder OF halfadder IS COMPONENT xor2 PORT( x1, x2: IN bit; xout: OUT bit ); END COMPONENT; COMPONENT and2 PORT( a1, a2: IN bit; aout: OUT bit ); END COMPONENT; BEGIN xor_instance: xor2 PORT MAP( x1 => a0, x2 => b0, xout => s0 ); and_instance: and2 PORT MAP( a1 => a0, a2 => b0, aout => c1 ); END structural_halfadder;
Comparison of behavioral and structural description of a half-adder

The digital computer model may basically be assumed as a VHDL-based state machine that implements the fetch, decode, and execute cycle. The first few lines declare internal registers for the processor along with the states needed for the fetch, decode, and execute cycle. Computer's RAM memory is implemented using the LPM_RAM_DQ function.
LIBRAY IEEE; USE IEEE:STD_LOGIC_1164.ALL; USE IEEE:STD_LOGIC_ARITH:ALL, USE IEE:STD_LOGIC_UNSIGNED.ALL; LIBRARY Ipm; USE Ipm:Ipm_components_ALL ENTITY SCOMP IS PORT( clock; rest : IN STD_LOGIC program_counter_out : OUTSTD_LOGIC_VECTOR(7 DOWNTO 0), register_AC_out : OUTSTD_LOGIC_VECTOR(15 DOWNTO 0), memory_data_register_out : OUTSTD_LOGIC_VECTOR(15 DOWNTO 0), END SCOMP,

ARCHITECTURE a OF scomp IS TYPE STATE_TYPE IS (reset_pc, fetch, decode, execute_add, execute_load, execute_store, execute_store3, execute_store2, execute_jmp); SIGNAL state:STATE_TYPE; SIGNAL instruction_register, memory_data_register : IN STD_LOGIC_VECTOR(15 DOWNTO 0); SIGNAL register_AC : IN STD_LOGIC_VECTOR(15 DOWNTO 0); SIGNAL program_counter : IN STD_LOGIC_VECTOR(7 DOWNTO 0); SIGNAL memory_address_register : IN STD_LOGIC_VECTOR(7 DOWNTO 0); SIGNAL memory_write : STD_LOGIC; BEGIN memory:lpm_ram_dq (lpm function for computers memory 256 16 bit words) GENERIC MAP( lpm_widthad 8, lpm_outdataUNREGISTERED lpm_indata REGISTERED lpm_address_control UNREGISTERED lpm_file program.mif lpm_width 16) PORT MAP ( dataRegister_AC, addressmemory_address_register, wememory_write, inclockclock, qmemory_data_register ): program_counter_out program_counter; register_AC_out register_AC; memory_data_register_out memory_data_register

PROCESS {CLOCK, RESET} BEGIN IF reset = 1 THEN state reset_pc; ELSEIF clock EVENT AND clock = 1 THEN CASE state IS - reset the computer; need to cleat some registers WHEN reset_pc program_counter 00000000; memory_address_register 00000000; register_AC 0000000000000000; memory_write 0; state fetch

The fetch state adds one to the PC and loads the instruction into the IR
state WHEN fetch instruction_register program_counter memory_write state fetch; fetch instruction from memory and add 1 to PC memory_data_register; program_counter + 1; 0; decode;

After the rising edge of the clock signal, the decode state starts. In decode, the low 8 bits of the instruction register used to start a memory read operation in case the instruction needs a data operand from memory. Decode state contains a CASE statement to decode the instruction using the opcode value in the high 8 bits of the instruction. This means that the computer can have up tp 256 different instructions, only four are realized
decode instruction and send out address to any data operation WHEN decode memeory_address_registeri instruction_register (7 DOWNTO 0); CASE instruction_register (15 DOWNTO 0) IS WHEN 00000000 state execute_add; WHEN 00000001 state execute_store; WHEN 00000010 state execute_load; WHEN 00000011 state execute_jump; WHEN OTHERS state fetch; END CASE;

After the rising edge of the clock signal, control transfers to an execute state that is specific for each instruction.
Execute the ADD instruction
WHEN execute_add register_ac register_ac+memory_data_register; memory_address_register program_counter state fetch; Execute the STORE instruction; needs 3 clock cycles for memory write WHEN execute_store write register_A to memory memory_write 1; state execute_store2; this state ensures that the memory address is valid until after memory_write goes inactive WHEN execute_store2 memory_write 0; state execute_store3; WHEN execute_store3 memory_address_register program_counter; state fetch;

Execute the LOAD instruction WHEN execute_load register_ac memory_data_register; memory_address_register program_counter state fetch; Execute the JUMP instruction; WHEN execute_jump memory_adress_register instruction _register( 7 DOWNTO 0); program_counter instruction_register( 7 DOWNTO 0); state fetch; WHEN OTHERS memory_address_register program_counter; state fetch; END CASE; END IF; END PROCESS; END a:

It should be noted that some instructions can execute in one clock cycle and some instructions may take more than one clock cycle. Instructions that write to memory will require more than one state for execute because of memory timing constraints. As seen in the STORE instruction, the memory address and data needs to be stable before and after the memory write signal is 1, hence additional states are used to avoid violating memory setup and hold times. When each instruction finishes the execute state, MAR is loaded with the PC to start the fetch of the next instruction. After the final execute state for each instruction, control returns to the fetch state.
1.6 Simulation of the VHDL Model of a Processor
1.7 MORE

Pipeline processors carry out a command in several clock steps that generally run concurrently, Superscalar processors process more than one command per clock signal, e.g. through multithreading, based on their corresponding internal functional units, RISC (Reduced Instruction Set Computer) processors carry out one command per clock signal. They are also designated as scalar architecture and contain many registers and a few simple commands, CISC (Complex Instruction Set Computer) processors contain many large commands but few registers, DSPs (Digital Signal Processors) are especially geared towards digital signal processing and show instruction sets, e.g. combined multiplication and addition commands that are carried out in one clock signal.

Advanced Computer Architecture (ACA) /lecture: LV64-446, Module MV5.1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computer Architecture (ACA) /lecture: LV64-446, Module MV5.1

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture (ACA)/Lecture

LV64-446, Module MV5.1

Advanced Computer Architecture (ACA)/Lecture

Advanced Computer Architecture (ACA)

Lecture Organization 1. Local and Global Concepts for Processors

Advanced Computer Architecture (ACA) 1.1 Digital Computers History

1st generation of Microprocessors deal with integration of functions on one chip

8008 chip mounted in "C" package

C8008Gray CerDIP, 18-pin

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

TMS9900JDL in ceramic package

8086 in 40 Pin DIP Package

Zilog Z8002 in 40 Pin DIP Package

Motorola MC68000 in 64 Pin DIP Package

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Intel 8086, 40 pin DIP (Dual in-line package)

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

DOS-Word processing Accounting -PC

Advanced Computer Architecture (ACA)

1.1 Digital Computers History Moore's Law

Advanced Computer Architecture (ACA)

1.1 Digital Computers History

Advanced Computer Architecture (ACA) Moore's Law result in 109 T

Semiconductor: Assuming 10GHz: 3mm.

Advanced Computer Architecture (ACA)

Facts limiting clock speed

Advanced Computer Architecture (ACA)

Facts limiting clock speed

Advanced Computer Architecture (ACA)

Facts limiting clock speed

Advanced Computer Architecture (ACA)

Facts limiting clock speed: time delay through micro-wires

Advanced Computer Architecture (ACA)

Facts limiting clock speed: wire time delay

Metal wire, Al, CU

Advanced Computer Architecture (ACA)

Facts limiting clock speed: wire time delay; capacitor

Advanced Computer Architecture (ACA)

Facts limiting clock speed: wire time delay

Capacity of the wire:

Advanced Computer Architecture (ACA)