Professional Documents
Culture Documents
Advanced Computer Architecture (ACA) /lecture: LV64-446, Module MV5.1
Advanced Computer Architecture (ACA) /lecture: LV64-446, Module MV5.1
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processors:
INTEL 4004: 4-bit P contains 2000 transistors capable processing 4 bits of information at a time at a rate of ~0.06 MHz market launch 1971. Two patents cover Intel's P 4004: Patent # 3,821,715, Memory System for a Multi-Chip Digital Computer in the names of Ted Hoff, Stan Mazor and Federico Faggin Patent# 3,753,01, Power supply settable, bi-stable circuit in the name of Federico Faggin.
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processor Intel 4004:
White/Gold CerDIP, 16-pin & Architecture
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processor Intel 4004: Microchip
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 1st Generation Processor:
http://www.intel.com/museum/archives/4004.htm
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 2nd Generation Processors:
Intel 8008: 1st 8-bit P, implemented as TTL logic chip, architecturally different than 4004, with a 16-bit-wide address range, and 8 bit wide data words, capable of processing 8 bits of information at a time at a rate of ~ 0.8 MHz, market launch 1972. Instruction set encompassed 48 commands. Reason for the 8008 was the necessity to develop a keyboard controller that encompassed 7 or 8 bits, which was twice the 4004s processing range. Scopes of the 2nd generation are: Increase of address space up to 16 Bit (65.536 Memory bit), Optimizing clock cycles of the instructions, Add on interrupt facilities through interrupt request and sub routines This features are not available with the market launch of the Intel 8008, they become available with the Intel 8080 (1974). 8008 was an important transition CPU for Intel, work on the 8008 enabled the creation of the 8080 (which included the 8008 instruction set).
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 2nd Generation Processor Intel 8008:
http://www.antiquetech.com/chips/8008.htm
8008 Close-up
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 2nd Generation Processors:
1974 the NMOS-Technology (2 supply voltages) was introduced, in some special cases also CMOS-Technology was available. 1st NMOS processor was the 6800 from Motorola. 1976 the Intel 8080A in NMOS-Technology was available and later on the 8085 which was not a successor hence the 8086 was developed which expand the 8080 design from 8 to 16 Bit. Hence two different sources are available, for this Generation, the 8080 for the 80386, and the 6800 for the 68030.
Summarizing the 2nd generation: introduction of 8 bit word length handling of asynchronous generated interrupts, which are important for process control expanding the instruction sets and address modus up to an international accepted standard
development of processors and controllers had been separated partly because of this, the 16-bit processing range was introduced and the existing 8-bit technology was improved.
The 16-Bit-Technology was introduced in 1973 with the IMP-16 Processor from National Semiconductor. 1975 the 16-Bit One-Chip Processor PACE in PMOSTechnology was introduced as well as the NMOS Version, the INS8900. Introducing of 16-Bit word length has the big benefit, that optimization strategies for solutions, based on 8 bit word length, are no longer in the focus, this is while performance and velocity are much important. This result in the implemented and processed 16-Bit Variables in the Definition of C, based on the variable concepts , like SHORT INTEGER or INTEGER.
Advanced Computer Architecture (ACA) 1.1 Digital Computers History 3rd Generation Processor Intel 8086:
http://en.wikipedia.org/wiki/Microprocessor
Die of an Intel 80486DX2 microprocessor (actual size: 126.75 mm) in its packaging
Summarizing the 5th generation: improvement of the instruction times embedding of additional units on the chip
Intel Pentium Pro Processor (1995) used for fuel 32-bit server and workstation applications, enabling fast CAD, mechanical engineering and scientific computation; packed with second speed enhancing cache memory chip; bears 5.5 million transistors Intel Pentium II Processor (1997) designed specifically to process video, audio and graphics data efficiently, bears 7.5 million-transistor, incorporates MMX technology Intel Pentium II Xeon Processor (1998) designed to meet the performance requirements of mid-range and higher servers and workstations Intel Pentium Celeron Processor (1999) designed to meet PC market segment; provides high performance at exceptional price; delivers excellent performance for uses such as gaming and educational software Intel Pentium III Processor (1999) designed to significantly enhance Internet applications, allowing users to browse through realistic online museums and stores and download high-quality video features; processor incorporates 9.5 million transistors, and 70 new instructions, 0.25-micron technology was introduced
Clock Cycle
~ 2GHz
Application Area
cheap and fast processors for office applications processors for notebooks with low power High performance processors for CAD, Server, Games, etc.
~ 2,5 GHz
~ 3GHz
4,66 MHz
10
1GHz MB 100
This refers to the cycles per second of the main clock of the CPU
2 ... 3 GHz clock speed CPU faster clock speeds are elusive
10
KB
104
T 1975
1MHz
2005
If the speed GHz were to be a car then the cache is the traffic light. No matter how fast the car goes it still will not hit that green traffic light. The more speed and the more cache is available the faster the processor is
Advanced Computer Architecture (ACA) Clock speeds > 2 GHz are elusive
Assuming 1GHz: Light in vacuum travel in 1 clock cycle ca. 30cm Assuming 10GHz: 3cm Semiconductor: Assuming 10GHz: 3mm.
Taktzyklus
Advanced Computer Architecture (ACA) Clock speed > 2 GHz are elusive
Assuming 1GHz: Light in vacuum travel in 1 clock cycle ca. 30cm Assuming 10GHz: 3cm
Taktzyklus
Isolator
F C 0 r d
F
wire
Isolator ( r ) Substrat
l R A
l
wire
U C U 0 (1 e
t RC
)
U
t
U
RC
R 0
t
Advanced Computer Architecture (ACA) Facts limiting clock speed: wire time delay assuming half of length of l structure lneu 2 l Resistance of reduced wire length: R A Aneu A 2
Capacity of reduced wire length:
C 0 r
Fneu F 0 r d 4d
Fneu
lneu
wire (new)
wire
2
Resistance of reduced wire length :
lneu R Aneu
l 2 l A A 2
Fneu F C 0 r 0 r d 2d
These CPUs can cache multiple instructions per clock cycle, which dramatically speeds up a program. Other factors influence speed, like the mix of functional units, bus speeds, available memory/cache, length of pipeline with 32 kernels and type and order of instructions in the programs being run. Development of HIT (hyper threading technology) with more internal data processors with second register set with independent I/O logic. For the operating system this processors can be operated as dual core architecture
Dual Core
10
KB
104
T 1975
1MHz
2005
Cache Memory
CPU Reg.
Caches Main Memory Bulk Memory (Disc) Archive Memory (Streamer) Capacity Access Time
Cache Memory
Moore's Law
Year Transistors 2007 2 * 109 2009 4 * 109
CPU Reg.
2011 8 * 109
2013 16 * 109 2015 32 * 109 2017 64 * 109 2019 128 * 109
Floor space gain in processors Higher speed Less el. Power Intel
Cache Memory
Cache Memory
Memory Access Cache small but fast memory, close to the processor
Memory
1.2 Introduction
Digital Computers consist of three main components: processor or central processing unit (CPU), memory that stores program instructions and data, input/output hardware that communicates to other devices. The CPU consist of two main components: arithmetical logical unit (ALU), control unit (CU). CU is a complex state machine that control the internal operation of the digital processor
1. 2 Introduction
CPU, CU, Memory, and I/O are linked by an electrical highway, a communication interface for data transmission, called Bus. Typically, signals on the bus include: memory address, memory data, bus status.
Bus status signals indicate the current bus operation: memory read (MR), memory write (MW), input/output operation (I/O).
1. 2 Introduction
PC: Program Counter IR: Instruction Register AC Accumulator ALU: Arithmetical Logical Unit MAR: Memory Address Register MDR: Memory data register
1.2 Introduction
Internally CPU contains a small number of registers - built up by using so called D-flip-flops for data storage - which are used to store data inside the processor
Remember: D flip-flop (see LV 18.003, Module IP 7.3) Output always takes on the state of D input at the moment of a rising clock edge, and never at any other time. Flip flop is called D flip-flop for this reason, since the output takes the value of the D input or Data input, and Delays it by one clock count. D flip-flop can be interpreted as a primitive memory cell. Truth table: ('X' denotes a Don't care condition, meaning the signal is irrelevant) Symbol D-flip-flop
Clock Rising edge Rising edge Non-rising D 0 1 X Q 0 1 constant Qprev x x
Modern processors contain at least one or more arithmetic and logical units (ALU) inside the CPU. ALU is used to perform arithmetic and logical operations on data values.
1.2 Introduction
ALU operations include (at minimum): add, subtract, logical and/or operations shift Register to bus connections hard wired for simple point-to-point connections. If one of several registers can drive the bus, the connections are constructed using: multiplexing, open collector outputs, Tri-state outputs
Example 2: Basic Processor Instructions Instruction ADD STORE LOAD JUMP JNEG Mnemonic Operation Performed Hypothetic op code value
address AC AC + content of memory address address content of memory address AC address AC content of memory address address PC address address If AC < 0 then PC address
00 01 02 03 04
Program variables A, B, and C are typically stored in dedicated memory locations. Symbolic representation of the instructions (shown in first column) is called assembly language Symbolic representation based on the assembly language program (shown in second column) is called machine language, representing the binary pattern that is actually loaded into the computer's memory. Machine language can be derived using the given instruction format. op code representation for each instruction is shown in the first column, which provides the first two hexadecimal digits in machine language. Second assign the data value of A, B, and C to be stored in hexadecimal addresses 01, 02, 03 in memory. Address provides the last two hexadecimal digits of each machine instruction
Assemblers: computer programs that automatically convert the symbolic assembly language into the binary machine language.
Compilers: programs that automatically translate higher-level languages, such as C, C++, etc. , into a sequence of machine instructions.
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle
Processors reads or fetches an instruction from the memory, decodes the instruction to determine what operations are required, and then executes the instruction:
Fetch next InstructionDecode InstructionExecute Instruction
Implementing fetch, decode, and execute cycle requires several register transfer operations and clock cycles. A specific state machine (control unit CU) controls sequence of operations within the processor. PC contains address of current instruction. Fetching next instruction from the memory the processor must increment the PC. Hence, processor must send the address in the PC to memory over the bus by loading the MAR and start a memory read operation on the bus. Instruction data will appear on the memory data bus lines, and will be latched into the MDR . Execution of the instruction may require an additional memory cycle; instruction is saved in CPU IR. Using the value of the IR, the instruction can now be decoded. Execution of instruction will require additional operations in the CPU, e.g. additional memory operations
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle
the current instruction is hold in the IR. This instruction is one of the possible machine instructions such as ADD, LOAD, STORE, etc.
When execution of the current instruction is completed, the cycle repeats by starting a memory read operation (MR) and returning to the fetch state. A state machine, so called control unit (CU), is used to control these internal processor states and control signals.
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle
Data path used for implementation of the processor kernel, consisting of Registers, Memory interface, ALU, Bus structures that are used to connect them. Three busses are used to connect the registers: Address Bus, Data Bus, Control Bus. On the bus lines a slash / with a number indicates the number of bits (width) on the respective bus. Data values present on the active busses are shown in hexadecimal numbers.
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle The change of data in registers register transfer
Load
+1
ALU Operation
Input certain registers through ALU, store back in register
PC 000 IR 000
Register
00
01
Store
Write register to memory location
I/O Memory
... ...
00 01
Processor Control Unit Data Path ALU Controller Control /Status Register
Instruction cycle broken into several sub-operations, each one clock cycle, e.g.:
Fetch: Get next instruction into IR Decode: Determine what the instruction means Fetch operands: Move data from memory to data path register Execute: Move data through the ALU Store results: Write data from register to memory
PC
IR
R0
R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory
...
100 10 101
...
Fetch
Processor Control Unit Data Path ALU Controller Control /Status Register
Get next instruction into IR PC: program counter, always points to next instruction IR: holds the fetched instruction
PC
001
R0
R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory 100 101
...
10
...
CU Sub-Operation Decode
Processor Control Unit Data Path ALU Controller Control /Status Registers
PC
001
R0
R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory 100 101
...
10
...
Processor Control Unit Data path ALU Controller Control /Status Registers
10
PC 001 IR load R0, M[100] R0 R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory 100 101
...
10
...
CU Sub-Operation Execute
Processor Control Unit Data Path ALU Controller Control /Status Register
Move data through the ALU This particular instruction does nothing during this sub-operation
10
PC 001 IR load R0, M[100] R0 R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory 100 101
...
10
...
Processor Control Unit Data path ALU Controller Control /Status Register
Write data from register to memory This particular instruction does nothing during this sub-operation
10
PC 001 IR load R0, M[100] R0 R1
I/O 001 load R0, M[100] 002 inc R1, R0 003 store M[101], R1 Memory 100 101
...
10
...
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle
A multiple clock cycles per instruction implementation approach was used in early generation processors. These processors hat limited hardware, since the VLSI technology at that time supported orders of magnitude fewer gates on a chip than now is possible in current devices. Current generation processors, such as those used in PCs, have a hundred and more instructions, and use additional means to speedup program execution. Instruction formats are more complex with up to 32 data registers and with additional instruction bits that are used for longer address fields and more powerful addressing modes, as mentioned before.
Advanced Computer Architecture (ACA) Cooperating sequential logic circuit Pipelining: Enhance instruction throughput
Washing Drying
8 1 2 3 4 5 6 7 8
2 1
3 2
4 3
5 4
6 5
7 6
8 7 8
No-Pipelining
Pipelining
Time
Time
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2
7 6 5 4 3
8 7 6 5 4 8 7 6 5 8 7 6 8 7 8 Pipelined
Instruction 1
Time
Advanced Computer Architecture (ACA) 1.4 Processor Fetch, Decode and Execute Cycle
To demonstrate the operation of a computer architecture, a VHDL model of it has to be build up and analyzed by simulation. The simple computer example introduced shall be used to demonstrate a simple VHDL design. As a surprise it may be noted that this design fits easily into a FLEX 10K20 device, a PLD of Altera.
LIBRARY ieee; USE ieee.std_logic_1164.ALL; ENTITY reg4b IS PORT( reg_in reg_out reg_clk reg_oe END reg4b; -- Edge-cloecked register for 4 bits including output enable : IN std_logic_vector (3 DOWNTO 0); -- Input to the register : OUT std_logic_vector (3 DOWNTO 0); -- Output : IN std_logic; -- Clock signal for storage : IN std_logic ); -- and output enable
VHDL source code: Entity for reg4b, a clocked buffer with a width of four bits
Architecture of reg4b
Black Box
Structural Description A1
Behavioural Description A2
Configuration C1 Configuration C2
ENTITY: The interface to the world outside is defined here. ARCHITECTURE: The functionality inside is described here. CONFIGURATION: The actual wanted configuration (if there are more than one possible) may be defined here. PACKAGE / PACKAGE BODY: Here often used declarations, functions etc. may be defined. This is an important part of the library system of VHDL
ENTITY halfadder IS PORT( a0, b0: IN bit; s0, c1: OUT bit ); END ENTITY; ARCHITECTURE behave_halfadder OF halfadder IS BEGIN s0 <= a0 XOR b0; c1 <= a0 AND b0; END;
ARCHITECTURE structural_halfadder OF halfadder IS COMPONENT xor2 PORT( x1, x2: IN bit; xout: OUT bit ); END COMPONENT; COMPONENT and2 PORT( a1, a2: IN bit; aout: OUT bit ); END COMPONENT; BEGIN xor_instance: xor2 PORT MAP( x1 => a0, x2 => b0, xout => s0 ); and_instance: and2 PORT MAP( a1 => a0, a2 => b0, aout => c1 ); END structural_halfadder;
WHEN execute_add register_ac register_ac+memory_data_register; memory_address_register program_counter state fetch; Execute the STORE instruction; needs 3 clock cycles for memory write WHEN execute_store write register_A to memory memory_write 1; state execute_store2; this state ensures that the memory address is valid until after memory_write goes inactive WHEN execute_store2 memory_write 0; state execute_store3; WHEN execute_store3 memory_address_register program_counter; state fetch;
1.7 MORE
Pipeline processors carry out a command in several clock steps that generally run concurrently, Superscalar processors process more than one command per clock signal, e.g. through multithreading, based on their corresponding internal functional units, RISC (Reduced Instruction Set Computer) processors carry out one command per clock signal. They are also designated as scalar architecture and contain many registers and a few simple commands, CISC (Complex Instruction Set Computer) processors contain many large commands but few registers, DSPs (Digital Signal Processors) are especially geared towards digital signal processing and show instruction sets, e.g. combined multiplication and addition commands that are carried out in one clock signal.