Microprocessor Systems: Introduction & Historical Review

Microprocessor Systems
Introduction & Historical Review
Dr. Taisir Eldos
“Study the past if you would define the future” - Confucius

µP Systems
★ Digital systems have prominent roles in life; social, commercial, industrial, scientific, etc.
★ Performance, cost, size, power, etc. vary; depending on what they are designed to do
★
Mi cr o pr o c e sso r S y stems
Computing platforms like Desktops, Laptops & Tablets represent a small fraction of the
world’s computing; the industrial and embedded platforms are numerous
Dra f t L ec tu re s
Artificial Intelligence (AI) has a touch on everything today, and this requires sophisticated
★
high performance processors to process vast amount of data like autonomous driving
★ Ubiquitous computing is about hardware & software engineering appearing everywhere in life
By
❖ Handheld computing devices
❖ Internet of Things (IoT) & Internet of Medical Things (IoMT)
★
D r. Ta i s i r El dos
Telematics is a technology that combines informatics and telecommunications (Internet) for
specific applications like fleet management, using data collected by devices like sensors
★ In modern societies, people deal with computing devices more than 70 times in a typical day
Jord an U n ive r si ty of Sci ence
❖ Automatic Tellers
❖ Security Gates
an d Tech no lo gy
❖ Automotives
❖ Phones
❖ Tickets
Dr. Taisir Eldos 2

General Purpose Computer
★ A typical microprocessor system consist of:
❖ Processor and support logic; clock, reset, etc.
❖ Memory where code executed and data stored; primary memory & secondary storage
❖ Input & Output; peripherals that interact with the outside world via an interface
❖ Connections; to connect all parts
❖ Glue Logic; to arbitrate operations various operations Memory

Power is supplied to various parts with the required voltage
By
★
❖ Conventional operating voltages Processor

✦ 12 V, Hard Disk Drives & Liquid Crystal Display
✦ ±12 V, RS-232 Communication Links
✦ 5.0 & 3.3 V, Legacy & Modern chips Mouse

Keyboard Screen

✦ 1.8 V, MultiMedia Cards, Flash Storage
❖ Modern operating voltages Touch

Camera Interface Printer
✦ 1.5 V, 1.2 V or 1.1 V for Memory Mic Speaker
an d Tech no lo gy
✦ 0.7 V to 1.4 V for Central Processing Units
Storage
✦ 0.6 V to 1.0 V for Graphic Processing Units
Dr. Taisir Eldos 3

Design Example - Small Computer
★ Define the target class of users, hence functions, performance & price (as range)
★ Write down what this system is going to do; the workload
★
★
Write down the specifications to achieve the workload
Search for the system components based on the target application
❖ CPU class, complexity, price & performance
❖ MEM class, size, price & performance
❖ I/O components, types & numbers
By
❖ Form Factor, Power Supply, etc.
★ Analyze trade-offs; cost & performance

★
Fewer components means
❖ Less assembly cost

❖ Less testing cost
❖ Smaller system
❖ More reliability
an d Tech no lo gy
❖ Less shipping cost
❖ Less power consumption
Dr. Taisir Eldos 4

Quality of Design
★ Primary Metrics
❖ Price, depends on the cost of material and design time and effort
❖ Power, important in portable systems for battery life and in large systems for cost
❖ Performance, doing more work in less time is always is a requirement
★ Secondary Metrics
❖ Form factor: Size, Weight, etc.
❖ Operating range: Temperature, Radiation, etc.
★ Design Steps:
By
❖ Reliability: Mean Time Between Failures (MTBF)
❖ Project statement; Objectives, Deliverables, Constraints, Milestones, etc.
❖ Project specifications; for Managers, Developers, Testers, Clients, etc.
❖ Project analysis; Cost, Assumptions, Discrepancies, Choices, etc.

❖ Co-Design (Hardware & Software) tradeoffs; Maximize performance & Minimize cost
❖ Implementation tradeoffs
an d Tech no lo gy
❖ Construction
❖ Testing
❖ Documentation
Dr. Taisir Eldos 5

The Beginning
★ Post Office Research Station invested in design a machine to break German Codes in WWII
★ Designed for specific task (not a programmable computer)
★
Primitive machines; bulky, clumsy and slow
❖ 16,000 Vacuum Tubes in Mark #1
★
❖ 24,000 Vacuum Tubes in Mark #2
9 KW of power consumption
No RAM at all ! Purpose specific design
By
★
★ Input / Output ?
❖ Input was paper tape
★
❖ Output was indicators lamps
Started in 1943
★
Retired in 1960 COLOSUS 1943, UK
an d Tech no lo gy
Dr. Taisir Eldos 6
Programmable Computer
★ USA invented a machine for ballistics, it Electronic Numerical Integrator And Computer
(ENIAC), completed in 1945 and retired 1955
★
★
Programmable by physical rewiring
University of Pennsylvania (Philadelphia)
Primitive machines; bulky, clumsy and slow
★
★ Caused city brownout when turned on

★ ENIAC was capable of doing:
❖ 5,000 Add Per Second
By
❖ 350 & 38 Multiply & Divide Per Second
❖ 10 to 20 FLOPS (By software routines)
❖ Took 70 hours to compute to 2037 places

ENIAC 1946, USA
★
Jord an U n (Japan):
ive36rTFLOPS
Modern Computers (TOP500 List, #1))
❖2004, NEC’s Earth Simulator si ty of Sci ence M3 Max Based Personal System
2012, Cray’s Titan (USA): 18 PFLOPS
an 440 d Tech no lo gy
❖
16.4 TFLOPS & $3,000
❖2021, Fugaku (Japan): PFLOPS ...
❖ 2022, Frontier (USA): 1100 PFLOPS NEC’s 6,400 CPUs & $500,000,000
Dr. Taisir Eldos 7

𝝅
Integrated Circuits
★ In the late 1950, transistors replaced the electronic valves as a switch to build gates and use in
making computers as discrete elements, since they are smaller, faster and more reliable
★
Todays transistors are built using Complementary Metal Oxide Semiconductor (CMOS) type
for density and power consumption
Integrated Circuits (ICs) is about placing the whole circuit, transistors and connections, on a
★
single die yielding smaller space and more reliability
★ Dies are packaged in what are called chips; with various forms and sizes
★
By
Contacts or pins on the chip provide communication with others
❖ Used as Address, Data, Control, Power & Test connections VCC
AB
DB
❖ Started with 16 and went 18, 40, 64, …, now 6096 CB
GND
★ How does the number of transistors affect performance? CB
★ How does wider data bus affect performance?

★
★
How does wider address bus affect performance?
How does higher clock frequency affect performance?
CPU
★
an d Tech no lo gy
How does power consumption relate to performance?
Why do we have many VCC& GND inputs ?
CPU
Dr. Taisir Eldos 8

1st Generations (1946 - 1958)
★ Designed in 12 months & built in 18 months for the US Army, $500,000 ($7 Millions today’s)
★ Technology
❖ Processing: Electronic valves
❖ Memory: Core memory and Magnetic tape
★
Specifications
❖ 1000’s of vacuum tubes and 1000’s electromagnetic relays
❖ Huge power demand (160 KW), with liquid cooling
By
❖ Weight of 30 Tons and 15 x 9 x 2.5 m (like 12 offices)
★ Computing Power
❖ Basic functions; thousands of decimal additions/second
❖ Compute ballistic tables (for the military)
cm
★
Ease of use
❖ Hardwired program for ballistic tables computation, took 3 weeks to change task
❖ Clumsy input/output
★ an d Tech no lo gy
Reliability
❖ Used to fail few times a day, only few days without failure
Dr. Taisir Eldos 9

2nd Generations (1958 - 1965)
★ Technology
❖ Discrete transistors, invented in1948 (used in 1958)
★
Compared to electronic valves
❖ Size: 5 x 5 x 5 mm versus 15 x 15 x 40 mm (50+ times smaller)
❖ Terminal count: 3 versus 5 or 6 (2 times lesser)
❖ Power: 5 mW versus 250 mW (50 times lesser)
❖ Voltage: 5 V or less versus 120 V (20 times lesser)
By
❖ Frequency: 1 MHz versus 0.1 MHz (10 times better)
❖ Mean Time Between Failures (at least 10 times better)
★
Today’s transistors are far better; much faster & lower power consumption
Historical review of Silicon
cm
❖ Jons Berzelius discovered Silicon in 1823 in Sweden
❖ John Bardeen, Walter Brattain & William Shockley invented Transistor in Bell Labs 1947,
and got Nobel Prize in Physics 1956
an d Tech no lo gy
❖ Robert Noyce developed the first integrated Circuit in 1958
❖ Robert Noyce and Gordon Moore founded Intel Corporation in 1968
Dr. Taisir Eldos 10

3rd Generations (1965 - 1975)
★ Integrated Circuits (IC), transistors and connections on a wafer, invented in 1959
★ Integration levels, based on components per chip
❖ 1962: 10 Transistors, Small Scale Integration (SSI)
❖ 1966: 100 Transistors, Medium Scale Integration (MSI)
★
❖ 1969: 1,000 Transistors, Large Scale Integration (LSI)
1971 Intel 4004: 2300 transistors, 10 µm, 16-pin

Int
el
80
80
❖ PMOS, 4-bit data & 12-bit address, 740 KHz, 46 instructions
★ By
1972 Intel 8008: 3500 transistors, 10 µm,18-pin
Int
❖ NMOS, 8-bit data, 14-bit address, 800 KHz, 48 instructions el
80
08
★ 1974 Intel 8080: 6000 transistors, 6 µm, 40-pin

❖ NMOS, 8-bit data, 16-bit address, 2 MHz, 10 times faster than 8008 Int
el
40
cm
04
★
★
The 4004 could do 60,000 Decimal Operations Per Second
But to have an idea about its Instruction Per Second (IPS) performance:
❖ Cycles Per Instruction (CPI) = 8
an d Tech no lo gy
❖ Instructions Per Cycle (IPC) = 1 / CPI = 0.125
❖ Instructions Per Second (IPS) = IPC x F = 740,000 x 0.125 = 92,500 IPS = 92.5 KIPS
Dr. Taisir Eldos 11

‎
Modern Systems
★ Integration Levels (Tr for Transistor, K, M & B for Kilo, Million & Billion)
❖ 1975: 10 KTr, Very Large Scale Integration (VLSI)
❖ 1980: 100 KTr, Ultra Large Scale Integration (ULSI)
❖ 1990: 1 MTr, Extremely Large Scale Integration (ELSI)
❖ 2000: 10 MTr, VLSI for all as a generic name
❖ 2010: 1 BTr
❖ 2013: 10 BTr
❖ 2015: 20 BTr By
❖ 2018: 40 BTr
❖ 2022: 60 BTr
❖ 2023: 90 BTr
cm
★
★
Today, chips are made of multiple dies for higher yields
Process, node or fab, used to refer to the transistor dimension (10 µm was the beginning)
Process today refers to feature size (like channel length); indicating smaller and smaller
an d Tech no lo gy
★
❖ 3 nm process can pack around 300 MTr/mm2; transistor cell is around 60 x 60 nm2
❖ 6 nm diameter copper wires used are 10,000 times thinner than human hair (60 µm)
Dr. Taisir Eldos 12

Silicon: Ingots, Wafers, Dies & Chips
★ Earth curst has 46% Oxygen, 28% Silicon, 8% Aluminum by weight, sand is mostly Silicon
★ Chip-Grade Silicon impurities must not exceed 1 in 109 (Carbon, Oxygen & Others)
★
Fabrication takes place in sophisticated foundries that cost 5 to 30 Billions of dollars, and may
cost 660 Million dollars to setup the production line for a layout (440 & 220 for 5 & 7 nm)
A chips takes 1000 to 2000 steps and 10 to 15 weeks to make; it may have 100 material layers,
★
including 5 to15 metal layers to route wires (> 100 Km, consuming 5% to 15% of the power)
★ A 300 mm raw wafer costs $200 to $400 while a processed one costs $1,000 to $20,000
★
By
More than 1 Trillion chips every year & more than half of them come from TSMC
Sand & Silicon Melting & Crystallization Ingot, 400 KG Wafers, 15 - 45 cm
an d Tech no lo gy
Dies ready to cut A single die A die; well and pins Chip Packaging
Dr. Taisir Eldos 13

The Die
By
an d Tech no lo gy
Intel 80286, 1982
134 KTr, 7mm x 7 mm
Intel i7 Octa Core, 2014
2.6 BTr, 18 mm x 20 mm
2.9 KTr / mm2 7.2 MTr / mm2
Dr. Taisir Eldos 14

Dies Per Wafer (Yield)
★ Wafers are 7.6, 10, 12, 15, 20, 30, 33 & 45 cm diameter, 0.1 mm to 0.9 mm thick wafers
★ Defect rates from 0.05 to 0.1 per cm2; a 30 cm wafer may have 30 to 70 defects
❖ Smaller dies have better yields, and
❖ Larger wafers have better yields, this led to using 45 cm wafers & multi-die chips
★
Largest monolithic die was 30 mm, and today’s most effective size is 10 mm to 15 mm
Apple’s M1 SoC, 5 nm & 120 mm2 die; a 30 cm wafer makes 450 dies, cost is $50 per chip
By
an d Tech no lo gy
15 cm wafer & 30 mm die
6 Good & 3 Defective
Silicon Utilization = 30% Silicon Utilization = 50% Silicon Utilization = 70%
Dr. Taisir Eldos 15

30 cm Wafer & 3 nm Features
★ To see a 5 nm thick copper wire in a die as a 75 µm thick hair, we need magnify by 15,000
❖ A 7 mm x 7 mm x 450 nm Die x 15,000 = 105 m x 105 m x 7 cm (11 dunum area)
★
❖ A transistor cell on the die is 60 nm x 60 nm will look like a point by a pen; 0.6 mm
The wafer to feature size ratio is 30 cm / 3 nm = 100,000,000

★
★
With 7 mm x 7 mm die size, the yield of 30 cm wafer is around 1,000 (nearly, 200 defective)
If the wafer fits the equatorial line, the 3 nm feature is like 13 cm (bread slice size)
UV Light are few µm to 1 mm above wafer; lets say 0.1 mm
By
★
★ Lithography is like a chef flying an aircraft at 10 Km above

the ground, topping 13 cm bread slices on the ground
❖ Jam D r. Ta i s i r El dos
without making any mess …
300 Km x 300 Km
Wafer Die
❖ Butter

❖ Thyme, etc.
Apartment Size
★ CMOS NOT Gate (middle)
an d Tech no lo gy
❖ 8 Features wire pitch
❖ 16 Features gate pitch

CMOS NOT Gate
❖ Billions of Gates
Dr. Taisir Eldos 16

Chip & Sockets
★ To connect various components comprising a system; CPU, ROM, RAM, PIO, SIO, PTC,
etc., we need a board to place and connect those chips
★
★
Every chip has pins, leads, balls or contacts, to connect the die in the well with the outside
Chips sit on sockets, whose mechanical design may improve heat dissipation
AMD Genoa, LGA6096 package, has 6096 contacts to supper 12 memory channel for 96
★
cores, more than 1 GB of L3 cache, 128 PCIe lanes, 3.7 GHz & 400 W.
★ Does this explain why 6096 contacts ?
★
By
Today, balls have 36 µm pitch
Dual In Line Package Thin Small Outline Package Plastic Leaded Chip Carrier Surface Mount Technology

(DIL / DIP) (TSOP) (PLCC) (SMT / SMD)
an d Tech no lo gy
Pin Grid Array Reduced Pin Grid Array Ball Grid Array Land Grid Array
(PGA) (rPGA) (BGA) (LGA)
Dr. Taisir Eldos 17

Chip Packaging
★ Flip-Chip is a technique in use for long time now; dies are face down to minimize wiring by
having direct contact with pins, balls or lands
★
Extreme shrinking of transistors cause tunneling (transistor conducts when it should not)
❖ Fabrication gets harder and harder due to the extremely small features
❖ Yield, the percentage of good chips in a wafer, gets lower and lower
★ The fabrication cost per mm2 depends on the process; doubled by moving 14 nm to 7 nm
★ Multi-die packages reduce cost, increase yield and provide for customizable modular design
By
❖ 3D Stacking, dies stacked vertically on top of each other; CPUs, GPUs, SDRAM, I/O, etc.
❖ Chiplet (Tiles), dies placed next to each other horizontally connected via interposer
❖ 3D Chiplet, imagine a complex 100 cm2 chip gets squeezed on 25 cm2 (4-story chip)

an d Tech no lo gy
3D Stacking Chiplet / Tiles 3D Chiplet
Dr. Taisir Eldos 18

Yield: Chiplet vs. Monolithic
★ Yield is the product of: wafer yield, die yield, packaging yield and burn-in yield
★ Yield decreases significantly with increasing die area; may become unfeasible …
★
Consider a 200 mm2 chip, the burn-in yield is …
❖ Monolithic (Single Die) 200 mm2 has yield of 40%, but …
❖ Chiplet (4 Dies x 50 mm2 each) has a far better yield of 70%, more than 75% more
90%
80%
70%
By 4-Chiplet
60%
50%
40%
30%
Monolithic
an d Tech no lo gy
20%
10%
20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 400
Dr. Taisir Eldos 19

Enhancing Performance
★ Performance metrics: Clock, IPC, MIPS, GFLOPS, but the absolute metric is the task time
★ Ultimate performance is completion time of a task; productivity test. Run your app to figure
out how good a machine is. To reduce this time:
❖ Wider Data Bus
✦ Early days: 4, 8, 16, 32 and today 64 bits
✦ Means more bits, hence information, transferred each cycle or unit of time
❖ Wider Address Bus
By
✦ Early days: 12, 14, 16, 24, 32, 36 and today: 40, 41, …, 48, 50, & 52 (4 PB of memory)
✦ Larger address space means accessing more Code & Data directly in fast memory,
without referencing slower storage devices like HDD
❖ Larger number of registers, hence more high-speed data available, (192 not all named)
❖ Large number of functional units; Adders, Multipliers, etc. hence more things in parallel

❖ Deeper pipelines; 10 to 30 stages in CPUs and 100s in GPUs
❖ Faster clocking; 6 GHz max by Intel Coe i9 with 24 cores consuming 250 W
an d Tech no lo gy
❖ Larger caches (2 x 64KB L1, 256 KB L2 & 250 MB L3)
❖ More cores (100 Cores) & hence more memory channels (8)
★ Leading to huge: number of transistors, number of contacts, and power consumption
Dr. Taisir Eldos 20

Modern Computing
★ The Central Processing Unit (CPU) consists of:
❖ Arithmetic Unit (AU), to perform arithmetic, logic and shift operations
❖ Register Files (RF), to store operands and results temporarily
❖ Control Unit (CU), to decimate what to do in every cycle
★
CPU is good in general data manipulation and operations on integers and strings, not
mathematically intensive operations like floating point numbers crunching
★ Floating Point Unit (FPU) is a special AU that is good at floating point arithmetic
★
★
By
Using FPU next to the CPU enhance performance; work in parallel and more
Integrating the FPU & the CPU on a die makes it even faster
★
Integrating more CPUs enhances parallelism and yields faster performance

CPU FPU FPU FPU FPU
CPU CPU CPU#1 CPU#1
an d Tech no lo gy EU CPU#2
EU
CPU#2
CPU#3
Early days Later Integrate Integrate more
EU
Dr. Taisir Eldos 21

Modern Computing
★ EU with split L1 caches (Data & Code) make increase performance significantly
★ Adding a larger unified L2 caches will improve performance even more by reducing the
★
memory latency, vial increasing the hit ratio for the workload
That is a typical single core processor
We may integrate 2, 4, 6, 8, etc. of them to work in parallel on different programs or threads
★
★ A shred L3 cache makes improves the performance more and more

★ Multi-Core processors require large bandwidth; large amount of data and instructions to keep
By
the cores busy, hence we use multiple memory channels
EU EU EU EU EU
L1 C L1 D L1 C L1 D L1 C L1 D L1 C L1 D L1 C L1 D

EU
L2 CD L2 CD
L3 CD
L2 CD
an d Tech no lo gy
L1 C L1 D
L2 CD
Single Core Dual Core L1/L2 Dual Core L1/L2/L3
Dr. Taisir Eldos 22

Architectural Enhancement
★ Technological enhancements like clocking faster does not pay off much anymore
★ Architectural enhancement like packing more functional units requires packing more
★
transistors on the chip
Fabs could integrate only 200,000 transistors on a chip in the early 80’s! Where did they go?
Researchers realized that more than half for the control sections due to large instruction set
★
and number of addressing modes.
★ Also need to add more and more on die like Memory Management Unit (MMU). What to do?
★
By
Make the control unit less complex; save transistors for registers, caches, function units, and it
runs fast as it is small and hardwired
★ This lead to anew design philosophy
❖ Reduced Instruction Set Computers (RISC)
✦ Hardware centric, easier on the designer & harder on the compiler

✦ Example: ARM, MIPS, PowerPC, PA-RISC, RISC-V, etc.
❖ Complex Instruction Set Computers (CISC)
an d Tech no lo gy
✦ Software centric, easier on the programmer & harder on the designer
✦ Examples: Intel x86, AMD Ryzen, Motorola MC68000, SUN SPARC, etc.
Dr. Taisir Eldos 23

Design Trends - RISC vs. CISC
★ RISC
❖ Small instructions set, typically < 100 (RISC-V has 47, with all addressing modes nearly 200)
❖ Small number of addressing modes, typically ≤ 5
❖ Uniform instruction length; take 1 word
★ CISC Dra f t L ec tu re s
❖ Simple control logic, hence fast and easy on the real estate (< 20%)
❖ Large instruction set, typically > 200 (Intel’s x86 > 1500)
By
❖ Large number of addressing modes, typically ≥ 8
❖ Variable length instructions, few to several (x86: 1 to 15 bytes & VAX 11/780: 1 to 57 bytes)
★
❖ Complex control logic, consume more real estate and has to go microprogramming (> 50%)
The two design styles coexist; each has its advantages … and they borrow from each other
★
Some processors have RISC cores or engine and a hardware shell that translates CISC
instructions to RISC sequences on the fly, and caches them for fast processing
★ However, RISC is more power efficient (higher ratio of Performance to Power Consumption)
★
an d Tech no lo gy
Today, 20% to 30% of the die real estate is non-core; chip cache, graphics unit, memory
controller, clock distribution network, connections fabric, communications links and others
★ Around half of the non-core real estate goes for level 3 cache which is shared (in GB now)
Dr. Taisir Eldos 24

Specifications - RISC vs. CISC
★ RISC compilers generate larger binary codes; around 25% to 50% more (May even double).
★ A high level language source code may generate:
CISC Code RISC Code
❖ 1500 to 1800 instructions for a CISC processor, and
AND R1, R2 AND R1, R2
❖ 2000 to 2400 instructions for a RISC processor
MOVE A, B LOAD A, R1
★ Not all RISCs or CISCs are equal …
STORE R1, B
★ Hence, binary files vary in instructions count
❖ Normally, within ±10% close
By
ADD R1, A LOAD A, R2
❖ Quite similar ones, within ±5% close
ADD R1, R2
STORE R2, A
★ RISC compilers are hard to design and take longer to run
★
RISC / CISC code segments
❖ A & B are variables
ADD A, B LOAD A, R1
LOAD B, R2
ADD R1, R2
★
❖ R1 & R2 are registers
Example
STORE R2, B
❖ A source code compiled for a RISC generated 1987 instructions
an d Tech no lo gy
❖ Assuming 30% less instructions when compiled for a CISC; 0.7 x 1987 = 1390.9 = 1400
❖ Estimated to generate ±10% when compiled for another RISC; 1800 to 2200
Dr. Taisir Eldos 25

State of the Art - Specifications
★ CPUs, general purpose cores
❖ Wide range of instructions for any task
❖ Typically, 10s of cores depending on platform
✦ 2 to 4 cores, Controllers (Basic ones use 32-bit single core)
✦ 4 to 8 cores, Gaming (overclocked & power hungry)
✦ 8 to 12 cores, Laptops (slower & power efficient)
✦ 8 to 12 cores, Desktops (faster & more power consuming)
By
✦ 12 to 24 cores, Workstations (fast & power hungry)
✦ 24 to 64 cores, Servers (slower but too many tasks at once)
★
✦ 48 to 192 cores, Data Centers
GPUs, graphic and compute cores

❖ Special purpose graphic cores & compute-oriented for highly parallel tasks
❖ 100s to 1000s of GPUs on chip as co-processors add on
NPUs & TPUs, Neural Processing Units & Tensor Processor Units
an d Tech no lo gy
★
❖ Specialized cores for machine learning and artificial intelligence
❖ Typically, 10s of cores on die
Dr. Taisir Eldos 26

State of the Art - Components
★ Cache Memory:
24-core Intel Core i9 costs $700
❖ L1: 64 KB Code & 64 KB Data per core
96-core AMD Epyc Genoa costs $3,000
❖ L2: 256 KB/512 KB, Unified Code/Data per core
❖ L3: 8 MB Laptops, 16 MB Desktops, 64 MB Workstations
★
❖ L3: 8x32 = 256 MB in AMD Epyc Rome & 1.1 GB in AMD Epyc Genoa for Servers
UnCore Components
❖ Memory Management Unit, Virtual Memory and Multi-Channel Interfaces
By
❖ Thermal Control, Maximize core utilization and performance within the power envelope
★ Clock: operating frequency varies with the number of cores and workload…
❖ 9.0 GHz under nitrogen cooling for short period of time (overclocked for several minutes)
❖ 6.0 GHz low core count; up to 24 cores

❖ 3.2 GHz high core count, up to 64 cores (with some cores running at 4.2 GHz)
❖ 3.0 GHz very high core count, up to 192 cores
Today, processors have two types of Compute Cores

an d Tech no lo gy
★
❖ Performance Cores (P-Core), high performance for foreground operations (more power)
❖ Efficiency Cores (E-Core), low power to run background operations
Dr. Taisir Eldos 27

Why 6 GHz Clock Frequency? Why not 50 GHz?
★ We are stuck at less than 6 GHz for long time now! With less than 3% increase every year
★ Why not clocking faster for better performance as we used to?
❖ Higher frequency requires higher operating voltage causing heating; a big challenge
❖ Signals encounter resistance leading to degradation; causing integrity issues
❖ Higher frequency produces more electromagnetic interference; causing crosstalk
❖ Higher frequency requires faster switching transistors; size and material limitation
❖ High frequency requires more precise manufacturing processes and materials
★ By
50 GHz clock? Assuming all above issues are sorted out
❖ Signal travels in the chip, silicon & wires, at 100,000 to 200,000 Km/s speed
❖ This is equivalent to 100 mm/ns to 200 mm/ns, lets us assume 100 nm/ns
❖ A clock cycle = 1 / 50 GHz = 0.02 ns, the signal travels only 2 mm, die side must be small

enough for the signal to travel back and forth edge to edge
❖ Hence the die side must not exceed 1 mm
❖ By proportion, Apple SoC A17 Pro has 150 to 160 mm2 die and clocks at 3.78 GHz, the
★
an d Tech no lo gy
die side must be reduced to (3.78/50)x13 = 1 mm to run at 50 GHz
Such die size, 1 mm2, can pack only 250 to 300 MTr using 3 nm process
★ Hardly enough for a single core, to many cores at 5 GHz is better
Dr. Taisir Eldos 28

How fast is 6 GHz ?
★ Man walks in steps of 60 cm at least; a stride or cycle 120 cm or 0.0012 Km
★ If he ticks at 6 GHz, his speed is 7,200,000 Km/s (24 times faster than light in vacuum)
★
Earth roundtrips to …
❖ Sun: 2 x 150,000,000/7,200,000 = 41.7 s
★
❖ Moon: 2 x 384,400/7,200,000 = 0.107 s = 107 ms
Earth orbiting takes …

❖ 40,075/7,200,000 = 0.00557 s = 5.6 ms (180 rps)
By
❖ While it takes …
L step R step 60 cm 60 cm
✦ Light, 135 ms (A blink of eye)
✦ 2,000 Km/h supersonic jet, 20 hours
✦ 5 Km/h man walk, 1 year (straight & non-stop)

Stride 166.67 ps
★
What can a processor do in such a small cycle?
❖ Arithmetic operations on integers and floating point numbers
❖ Logic & Shift operations on strings and numbers
an d Tech no lo gy
5.6 ms
53.4 ms
MOON EARTH
53.4 ms
Dr. Taisir Eldos 29

State of the Art - Power
★ Power consumption is a significant issue; it varies from class to class & vendor to vendor
❖ Legacy single core processors draw up to 0.3 A @ 5 V or 1.5 W max
❖ Modern single core processors draw up to 0.4 A @ 1.5 V or 0.6 W max
❖ Low core count processors draw up to 100 A @ 1.0 V or 100 W max
★
❖ High core count processors draw up to 600 A @ 0.6 V or 360 W max
Apple M3 Max has 92 BTr on 420 mm2, consumes 20 to 90 W (1 nW/Transistor max)

Class
Embedded
By
Power (W)
<1
Smart Phone 1 to 5
Tablet 5 to 10
Notebook 10 to 20

Laptop
Desktop
20 to 30
40 to 120
Gaming
an d Tech no lo gy
Workstation
60 to 120
100 to 200
Server 100 to 400
Dr. Taisir Eldos 30

State of the Art - Packaging
★ Consider a server class processor like the 64-core AMD Epyc Rome with 40 BTr chiplet:
❖ 8 dies, 8 cores and 32 MB L3 Cache per die, and
❖ 8 memory channels system controller with 128 PCIe Lanes, I2C, SATA, USB, RTC, etc.
❖ 0.9 V core voltage and 200 W maximum power consumption at 3.1 GHz
★
Memory: using 288-pin SDRAM DDR4 modules, there has to be 8 x 288 = 2304 pins
Graphics, Chipset & I/O: around 1000 pins
Power: 200 / 0.8 = 250 A, 0.5 A bonding wires, needs 250 / 0.5 = 500 pairs = 1,000 pins
By
★
★ Number of pins/contacts exceeds 4000. It in fact has 4094 contacts and uses SP3 socket
❖ Dies fabricated 7 nm process (8 dies on the left and right sides)
★
❖ System Control fabricated using 14 nm (rectangle in the middle)
Latest: Intel Xeon BGA5903 & AMD Genoa SP5 LGA6096

an d Tech no lo gy
Chip Top Chiplet Chip Bottom
5.8cm x 7.5cm
Dr. Taisir Eldos 31

Intel’s processors: 1974 versus 2018 (44 years)
★ Specifications
❖ 8080: 6 Thousands Transistors, 20 mm2 die using 6 µm node, 2 MHz, 10 CPI & 1 W
★
❖ i9: 7 Billions Transistors, 200 mm2 die size using 14 nm node, 5 GHz, 0.01 CPI & 50 W
Transistor Count
❖ 8080 chip has 6,000 / 20 = 300 Tr per square mm
❖ i9 chip has 7,000,000,000 / 200 = 35,000,000 Tr per square mm
❖ i9 transistors density is 1,166,667 times denser
★ CPI Performance
By
❖ i9 is 10 / 0.01 = 1,000 times better cycle utilization, although 2,500 times shorter cycle
Instruction per Second (IPS) Performance
❖ 8080 is 2 MCPS / 10 CPI = 0.2 MIPS = 200 KIPS
❖ i9 is 5,000 MCPS / 0.01 CPI = 500,000 MIPS. Hence i9 is 2,500,000 times better
★
Power Efficiency: Millions Instructions Per Joule (MIPJ)
❖ 8080 reaches 0.2 MIPS /1 JPS = 0.2 MIPJ = 200 KIPJ
❖ i9 reaches 500,000 MIPS / 50 JPS = 10,000 MIPJ. Hence 50,000 times more efficient
Gordon Moore predicted that number of transistors on a chip doubles every year (1.5 & 2 years)
★ This prediction went up and down since 1965, 2^(44/2) = 4 Million times (sort of performance)
Dr. Taisir Eldos 32

Performance Bottleneck - Memory Wall
★ Clock speed is an indication of how fast processors work, but what is done in this cycle?!
★ MIPS as a metric, tells how fast instructions are executed, but what they do? too much, too
★
little ? Higher MIPS CPUs may produce less work
MFLOPS is a better metric, but only one aspect is addressed, floating point arithmetic
capability. Apple’s A17 SoC is rated 2.5 TFLOPS
★ Dra f t L ec tu re s
The ultimate performance measure is time; how much it takes a processor to complete a task
★ Benchmarks are collections of programs with range of activities to index computers
★
★
By
Benchmarks may focus on integer, floating point, graphics, etc. for specific users
A major problem of today’s systems is called the Memory Wall
❖ Annual performance growth of processor is > 30%, while
❖ Annual performance growth of memory is < 10%
So, a workload with 70% of the time processing & 20% of the time memory accesses, we get:
★
❖ Processing time is reduced from 0.7xT to 0.7xT/1.3 = 0.54xT
❖ Memory access time is reduced from 0.2xT to 0.2xT/1.1 = 0.18xT, and
★
an d Tech no lo gy
❖ The rest is 0.1xT unchanged or 0.1xT; typically input/output related
System speed up is S = T / (0.54xT + 0.18xT + 0.1xT) = 1.22, or 22% performance growth
Dr. Taisir Eldos 33

Performance Enhancement - Amdahl’s Law
★ If a task has T completion time and α percentage of it gets enhanced by a factor β then
❖ Current completion time = T = (1-α)T + αT
❖ New completion time is Tx = (1-α)T + (α/β)T
❖ Speedup is S = T / Tx = 1 / ((1-α) + α/β)
★ What is the impact of doubling the core count? doubling the clock rate? and doubling both?
★ Assume a workload with P percentage of parallel activities & C percentage of computing activities
❖ Core Doubling, programs have some degree of parallelism (core scalability)
By
✦ Highly Parallel: α = 0.9 yields S = 1 / (0.1 + 0.9 / 2) = 1.8 (80% extra performance)
✦ Highly Serial: α = 0.2 yields S = 1 /(0.8 + 0.2 / 2) = 1.1 (10% extra performance)
❖ Clock Doubling, programs have some degree of number crunching (frequency scalability)
✦ Compute-Bound, α = 0.8 yields S = 1 / (0.2 + 0.8 / 2) = 1.7 (70% extra performance)
✦ Memory-Bound, α = 0.5 yields S = 1/ (0.5 + 0.5 / 2) = 1.3 (30% extra performance)
★
Compute the speedup of a task using 4 times the core count & 50% higher clocking frequency
❖ Assume the task spends 70% of its time processing with 80% degree of parallelism
an d Tech no lo gy
❖ 4 times the core count yields speedup of SN = 1 / (0.2 + 0.8 / 4) = 2.5
❖ 50% faster clocking yields speedup of SF = 1 / (0.3 + 0.7 / 1.5) = 1.3
❖ Total speedup is S = SN x SF = 2.5 x 1.3 = 3.25; while expected to be 4 x 1.5 = 6
Dr. Taisir Eldos 34

Power Consumption
★ Power consumption varies significantly from; manufacturer, platform, target customer, etc.
★ Full load power consumption may reach 10 times the idle state; 1 nW/Transistor max
★
★
Power efficiency is a hot issue; how much throughput per unit of energy in (Data Centers)
Single Board Computers (SBC) are small size lightweight subsystems with decent computing
performance, memory and general input/outputs terminals (some are expandable)
❖ Power consumption of 5 to 20 W, native operating system, a version of Linux or Windows
❖ Used as low level computers (desktop) or high level controllers (plant or process control)
★
By
IoT, IoMT, Embedded systems may consume < 1 W
Handheld Devices, < 5 W (Small Battery) & Portable devices, 10 to 20 W (Large Battery)
★
★
Desktops, 200 to 400 W
Workstations & Gaming, 400 to 600 W

Servers, 1000 to 2000 W, depends on capacity and attachments
★
★ SuperComputers & Data Centers are exceptional; 100s KW to 100s MW (Millions of cores)
★ Humans, 100 W (50 W sleeping to 1000 W exercising), depends on size, age, style, etc.
an d Tech no lo gy
❖ 10 to 20 W goes for brain activities
❖ 60 to 75 W goes for metabolism; breathing, heart activities, blood circulation, etc.
Dr. Taisir Eldos 35

Power - Personal Systems
★ Below is a power consumption breakdown, assuming commercial to professional desktops
★ For battery operated systems, the time for a battery to be depleted depends on the power
★
consumption of the various parts; mainly processor, memory, solid state storage
An M3 Max MacBook Pro has a 72.4 WH / 11.46 V Battery that lasts for about 10 hours
❖ 72.4 WH / 10 H = 7.24 W; the share of M3 Max is around 4 W
❖ How is that if M3 Max consumes 20 to 90 W? OS turns on & off parts as needed
Component
CPU
By
Power (W) Depends on: Vendor, Model, Size, Speed, Workload, etc.
40 – 100 Cores, Frequency, etc., 4 to 16 CPU cores & 16 to 32 GPU cores
GPU
CHIPsetD r. Ta i s i r El dos
80 – 300
20 – 40
Cores, Frequency, etc., 1,024 to 16,384 cores in discrete adapters
Complexity, Number of chips, etc.

SDRAM 5 – 10 Number, Type, Frequency, etc.
HDD 5 – 10 Number, Size, Speed, Spinning/Idle, etc.
an d Tech no lo gy
SSD 2–4 Number, Size, Technology, etc.
Fan(s) 5 – 15 Number, Size, Speed, etc.
Attachments ? Number, Type, etc., USB max 5 W & USB-C max 240 W
Dr. Taisir Eldos 36

Power - Corporate Systems
★ Frontier SuperComputer is #1 in 2023: 1.1 EFLOPS / 23 MW, cost $600 Million
★ Has 606,208 CPU cores, 8,335,360 GPU cores & 700 PB Memory (8,000 pounds on 680 m2)
★
★
Data Centers use 3% of the world’s power; 50 to 150 MW each & some exceed 600 MW
Google’s 35 data centers consume 16 TWH per year (100 countries consume less than this)
★
★
Google’s largest, in Finland, 681 MW of renewable energy (0.3 of Jordan’s consumption)
A rack hold 14, 21 or 42 Servers (3U/2U/1U) housing: Servers, Storage, Switches & UPS
A rack can house around housing 6,000 cores & consuming 10 to 30 KW
By
★
★ A data center may have 10 to 10,000 racks

★ Around 40% of the energy goes for cooling Switch
★
Consider a bank data center of 10 racks …
❖ 20 KW / rack computing
Server

❖ 10 KW / rack cooling
❖ 285 Fils / KWH power grid tariff for banks
✦ 10 x (20 + 10) = 300 KW total power Server

an d Tech no lo gy
✦ 300 x 24 x 30 = 216,000 KWH/month
Storage
UPS
✦ 216,000 x 0.285 = 61,560 JD/month
Dr. Taisir Eldos 37

System Structure
Dr. Taisir Eldos
“Science can amuse and fascinate us all, but it is engineering that changes the world.” - Isaac Asimov
Microprocessor System
★ Microprocessor systems vary greatly in complexity, performance, size, cost, etc.
★ But almost all have the same major components:
Mi cr o pr o c e sso r S yst ems
❖ CPU, to catty out tasks by running programs, along with Clock, Reset & Real Time Clock
❖ MEM, to store programs and data (input data and results)
D ra f t L e c tu re s
❖ I/O, to get information into and out of the system
❖ Mechanical structure to hold components, connections, glue logic, etc. (Motherboard)
By
RST, CLK, RTC
D r. Ta i s i r E l d os
CPU
Jorda n U n ive r si ty o f Sci ence

I/P DEC, ENC, BUF O/P
a n d Te ch n o l og y
ROM & RAM
Dr. Taisir Eldos 2

Basic Microprocessor System - Description
★ Kernel (Core)
❖ Central Processing Unit (CPU), the heart of any computer system
★
❖ CPU support: reset signal generator, clock signal generator, real-time clock module
Storage (Hierarchy; cache memory, main memory and mass storage)
❖ Solid state memory comes in different flavors (Mostly, random access and volatile )
✦ Read Only Memory; ROM, PROM, EPROM, EEPROM
✦ Read Write Memory; SRAM, DRAM, SDRAM
By
❖ Mass storage come in different flavors (Non-volatile)
✦ Electro − Opto, Magneto, Mechanical Devices, Hard Disk Drive, Compact Disk,Tape Drive, etc.
✦ Electronic, Solid State Device (SSD), Solid State Cards (SDC), Multi-Media Cards (MMC)
★ Peripheral Interface, to get data in and out …

❖ Input: Keyboard, Microphone, Camera, etc. & Pointing devices like Mouse, Touchpad, joystick, etc.
★
❖ Output: Monitor, Printer, Speaker, etc.
Glue Logic, to connect all parts via buses (sets of wires to transfer Data, Address & Controls info)
❖ Buffers, resolve the fan-out problem hence allow driving more and more loads in a complex system
❖ Decoders, partition the address space (select one memory or I/O chip for action)
❖ Encoders, resolve conflicts like requesting a service by many devices
Dr. Taisir Eldos 3

Types of Computers
★ Based on the way we interact with them, they are either special purposes or general purpose
❖ General purpose computers are designed to support wide range of applications
★
❖ Special purpose computers are design and optimized to carry out specific tasks
Both types use Processors, Memory & Storage with varying capabilities, and different kind of
★ D ra f t L e c tu re s
ports to support peripherals. However, they differ in the target application …
Special Purpose Computers
❖ Specific applications; Controllers & Embedded Systems
❖ Examples
By
✦ Simple: Oven, Fridge, Washer, Dryer, Traffic Controller, Automatic Teller Machine, etc.
✦ Complex: Medical Equipment, Airplane Autopilot, Autonomous Driving, Missile
Guidance System, Industrial Plants
General Purpose Computers
★
❖ Variety of applications; Scientific, Accounting, Editing, Financial, Games, etc.
❖ Examples
✦ Simple: Portable Computer, Personal Computer, Workstation
✦ Complex: Network, Server, Data Center, Supercomputer
Dr. Taisir Eldos 4

General vs. Special
★ CPU (Raspberry Pi: Quad-Core ARM, 1 MB L2 Cache, 2.5 GHz)
❖ General: 64-bit data, 40 to 50-bit address, many cores, and multiple memory channels
★
❖ Special: 8/16/32/64-bit data, 16 to 36-bit address, few cores, single memory channel
RAM (Raspberry Pi: 2/4/8 GB)
❖ General: ≥ 4 GB to support complex operating system, multi-tasking & large data
❖ Special: ≤ 4 GB to support real time operating system or control program

Ge
Storage
By
★ WiFi ner
al P
❖ TeraBytes HDD/SSD y BT I n pu r og r a
i spla tO m
SID utp mab
❖ GigaBytes SSD/eMMC/SD/ microSD D ut le
★
Communication
❖ Display & Camera
CPU
AM

❖ Wi-Fi & Bluetooth R
C
B-
US
❖ Ethernet & USB

1
MI
net
HD
2
❖ PCIe for functionality
er
MI Eth
Po era
HD
r
am
we
★ Special, GPIO for sensors and actuators B-3
IC
U S
CS
★ Home Servers & Plant Controllers, etc. SB

- 2
U
Dr. Taisir Eldos 5

Most Common Platform - Desktops
★ Microprocessor systems have parts that vary in count, size, performance, power, etc.
★ Workstation MoBo has many sockets for multiple CPUs, GPUs & SDRAM slots
★
★
Desktop MoBo has many ports for connectivity & expansion slots for functionality
Notebook MoBo has limited ports & expansion slots (Zero?)
★
★
All require power supplies that meet their needs
VRM is a DC-DC power supply for the CPU, GPU, MEM
SDRAM
By CPU Voltage Regulator Module
(VRM)
D r. Ta i s i r E l d os Ports
Power Supply

Ports
Motherboard Add-in Card
MoBo Expansion Slots
Dr. Taisir Eldos 6

General Microprocessor System - Minimal & Typical
CPU ROM RAM PIO SIO PCT
DB DB DB DB DB DB

WE*
AB
RD*
AB
OE*
PGM*
AB
OE*
WE*
RS
OE*
WE*
RS
OE*
WE*
RS
OE*
WE*
INT* INT*
CLK CLK CS* CS* CS* CS* CS*
By
RST RST*
DEC
MREQ* E* 0* I/O I/O I/O
1* DEV DEV DEV
A15
B 2*
A14
A 3*

IOREQ*
A7
E* 0*
1* SIOCS*
RTC
B 2* PCTCS*
INT*
A6
A 3*
NMI*
Dr. Taisir Eldos 7

Functional Description
★ CPU, Central Processing Unit; the core element in any microprocessor system
★ ROM, Read Only Memory; non-volatile memory to hold code and fixed data
★
★
RAM, Random Access Memory; read/write memory to hold computing results
PIO, Parallel Input Output; subsystem providing data exchange for near devices
★
★
SIO, Serial Input Output; subsystem providing data exchange for far devices
PCT, Programmable Counter Timer, subsystem proving timing signals
Support Logic
By
★
❖ CLK, Clock; square wave signal necessary for the CPU to function
❖ RST, Reset; power-on pulse to restart the CPU
★
❖ RTC, Real Time Clock; battery operated calendar subsystem
Glue Logic

❖ DEC, Decoding elements; selects one of many devices for a transaction
✦ DEC, for memory enabled by memory access operations
✦ DEC, for input/output enabled by input/output access operations
❖ IPE, Interrupt Priority encoder; selects the highest priority device to serve
★ I/O DEV, Input Output Devices via which the computer interacts with the world
Dr. Taisir Eldos 8

Special Microprocessor System - Traffic Lights Controller
★ A typical road intersection requires Red, Orange & Green lights to indicate the right to pass
★ Each path; East, West, South & North has:
❖ 8 outputs to switch the lights via four 8-bit latches LAX (LAXe, LAXw, LAXs & LAXn)
❖ 2 inputs from sensors to detect Emergency (E) & Congestion (C) read via a buffer BUF
CPU ROM RAM BUF NORTH
DB DB DB DB
AB
RD*
WE*
ByAB
OE*
PGM*
AB
OE*
WE*
E
DEC
CS* CS* OE*
DB
C
DB

MRQ* E* Y0*
DB LAX
C DB
CLK CLK B
A CK
CK
CK
RST RST* CK
Dr. Taisir Eldos 9

Traffic Light Controller - Functional Description
★ CPU, Central Processing Unit; the core element in any microprocessor system
★ ROM, Read Only Memory; non-volatile memory to hold code and fixed data
★
★
RAM, Random Access Memory; read/write memory, can be used for E & P times logging
LAX, Latch to holds the output state for path lights (4 latches, 1 per path)
★
★
BUF, Buffer via which we read the sensors (1 buffer, 2 bits per path)
Support & Glue Logic
❖ CLK, Clock; square wave signal necessary for the CPU to function
By
❖ RST, Reset; power-on pulse to restart the CPU
❖ DEC, selects ROM, RAM, BUF, or one LAX of the 4 (Output Latches)
★
BUF inputs come from Emergency sensors (E) for path priority as detected by sonar or
bluetooth receivers (Ambulance or Firetrucks). And Pressure sensors (P) for path time extra
when car queue length exceeds a threshold.
★ Jorda n U n ive r si ty o f Sci ence
Outputs of latch drive high-power transistors to operate the Red & Green lights of the four
groups: L-turn, S-lanes, R-turn & Walking lights (Assumed U-turn goes with the L-turn)
★
During the time between deactivating and activating the Red & Green, the Yellow will be
turned on automatically, as neither is active in this time, to save a dedicated output
★ A clock of 1 MHz is more than enough to operate such a simple design
Dr. Taisir Eldos 10

Micro Controller Unit (MCU)
★ MCU integrates CPU, ROM, RAM, I/O, support and glue logic, Communication ports,
Counters, Timers, etc. , to handle general control tasks

❖ 8-bit, 16-bit and 32-bit processor
❖ 1 KB to 1 MB ROM (PROM or Flash)
❖ 1 KB to 1 MB SRAM
❖ 4, 8, … , 48 pins of General I/O, with one or more alternative functions
★ 4 to 64-pin chips, with 4 or even more functions assigned to each pin.

★
By
Some chips can even change the pin number of a function (layout)
❖ Low end: 1 to 10 MHz clock, 20 to 30 pins, < 1 W
★
❖ High end: 20 to 100 MHz clock, 100 to 300 pins, 1 to 5 W
Cost ranges from few cents to few dollars Arduino Controller

Designing with MCU vs. CPU has many advantages:
★
❖ Shorter time to market (mostly ready and tested)
❖ Smaller size (most of the components integrated)
❖ Lower power
❖ Cheaper
Wi-Fi
Quad Relay Module
Dr. Taisir Eldos 11

System On Chip (SoC)
★ SoC integrates more components to support specific functions like audio processing, video
processing, communications, bus interfaces, etc.

❖ Global Positioning System (GPS)
❖ Global System for Mobile communications (GSM)
❖ Near Field Communication (NFC)
❖ Radio Frequency Identification (RFID)
❖ LiDar, Barometer, Accelerometer, Gyroscope, Compass, etc.
By
❖ Biometrics: Heart Rate & ECG, Blood Oxygen, Pressure, Glucose, etc.
❖ Security: Face Recognition, Fingerprint Recognition, etc.
★ Examples of SoC specific functions:
❖ Smartphones & Tablets
ESP32 SoC
✦ An SoC for a smartphone may include graphics, audio, video processing parts

❖ Televisions
✦ Sophisticated video functions, like scaling, upscaling, color processing, etc.
❖ Networks
✦ Switches & Routers use SoC to handle packet processing and routing fast
★ ARM based chips production alone exceeds 7 Billion chips per year
Dr. Taisir Eldos 12

Form Factors - General
★ Motherboards vary in size to accommodate enough components for the target platform
★ Even personal systems motherboards come in many form factors; desktop, tower, etc.
❖ From a single SoC to multi-socket CPUs
❖ From a single channel SDRAM to multi-channel SDRAM
❖ More ports, even of the same function, like:
✦ Video Graphics Array (VGA), Legacy
✦ Separate Video (S-Video), Legacy
By
✦ Digital Visual Interface DVI, Legacy
✦ High Definition Multimedia Interface (HDMI)
✦ miniHDMI
✦ microHDMI Workstation MoBo

✦ DisplayPort
✦ miniDisplayPort
✦ USB-C
Laptop MoBo PC mini MoBo Desktop MoBo
Dr. Taisir Eldos 13

Sample Processors
★ Specifications vary significantly based on vendor and target platforms; smartphone, tablet,
laptop, desktop, gaming machine, workstation, server, etc.
★
Below, typical specs for low-end, medium and high end platforms
A17 Pro M3 Max AmpereOne
Apple D ra f t L e c tu re s
Handheld devices Personal Computers
Apple
Servers & Data Centers
Ampere Computing
3 nm process
2 to 8 W
By
20 Billions of Transistors
3 nm process
8 to 40 W
5 nm process
150 to 350 W
6 CPU Cores 3.8 GHz 16 CPU Cores 3.6 GHz 192 CPU Cores 2.8 GHz
6 GPU Cores, 1.5 GHz
16 NPU cores
256 KB L1 per core
40 GPU Cores, 1.4 GHz
32 NPU cores
128 KB L1 per core
No GPU Cores
16 KB Code L1 per core
64 KB Data L1 per core

16 MB L2 8 MB L2 per core cluster
64 MB L3
2 MB L2 per core
64 MB L3
8 GB LPDDR5 DRAM
System: $1000 to $1500

128 GB LPDDR5 DRAM
System: $1500 to $3000

8 TB DDR5 DRAM
System: $4000 to $8000
Dr. Taisir Eldos 14

Example: Pro Laptop
★ Apple Pro MacBooks (2023) are based on home designed SoC M3 chips
★ Flavors: M3, M3 Pro & M3 Max (Maybe Ultra soon)
★
★
All are 3 nm process with M3 Max having 92 BTr on a 420 mm2 die
M3 Max system consumes 80 W full load & 3.5 W system nominal, 70 WH Battery lasts 20 Hours
❖ CPU, 12 Performance Cores; 4.05 GHz, 192 KB Code + 128 KB Data L1 & 32 MB L2 shared
❖ CPU, 4 Efficiency Cores; 2.75 GHz, 128 KB Code + 64 KB Data L1 & 4 MB L2 shared
❖ GPU, 40 Cores; 1.6 GHz, 6400 Compute units delivering 4.26 TFLOPS (FP32)
By
❖ NPU, 16 Cores; 1.125 GHz yielding 18 TOPS for AI & ML
❖ Unified CPU/GPU 128 GB LDDR5-6400 SDRAM
✦ 32 Channels x 16-bit each
✦ 6.4 x 2 = 12.8 GBps per channel
★ Jorda n U n ive r si ty o f Sci ence

✦ 12.8 x 32 = 400 GBps unified throughput
Integrates other functions

❖ Audio & Video processing
❖ Security & High speed data transfer
❖ Supports up to 8 TB SSD (7.4 GBps)
Dr. Taisir Eldos 15

Form Factors - Compact
★ Single Board Computer (SBC) is a small board with an engine capable of running proprietary
real time operating systems or light operating systems to full fledge operating systems like
★ Mi cr o pr o c e sso r S yst ems

Windows and Linux
Integrate processor, memory, mass storage, peripherals, and good connectivity
They can be used as computers, controllers, IoT devices, etc.
★
★ Low power, no fan, system on a tiny board; 15 to 60 cm2

★ Below are few examples of such systems from Intel, can be used for processing just like
★
By
general purpose computers but at varying scale
System on Module (SoM), integrates components for specific tasks on a small board.
★ From few dollars to few hundred dollars; Multi-Core
★ Low end ones MCUs can get down to few cents per piece

Raspberry Pi Zero 2 W Raspberry Pi 4 Latte Panda
Dr. Taisir Eldos 16

Desktop & Control
★ Systems are packaged based on needs; smartphones, to fit tablets, notebooks, desktops,
workstations, servers, stand alone controllers, embedded systems, etc.
★
★
An example of computer / controller; with many ports for Keyboard, Mouse, Monitor, etc.
Raspberry Pi 5 is built around an SoC with 4-core 64-bit ARM running at 2.4 GHz
2 GB, 4 GB & 8 GB LPDDR4 (for few to several tens of dollars)
★
❖ Gbps Ethernet with PoE support
❖ Dual USB 2, dual USB 3, USB-C (Power)
By
❖ Dual micro HDMI for 4K displays & Stereo Audio port
❖ Dual Camera support
❖ Dual Band Wi-Fi & Bluetooth
❖ MicroSD card

❖ 40-pin GPIO & PCIe x1 Gen. 2
❖ RTC & 15 W Power Jack
★ Pi OS, Android & Linux support

❖ High end controller, or
❖ Low end computer (heat sink? fan?)
Dr. Taisir Eldos 17

Artificial Intelligence
★ AI applications require supercomputing performance in small size and low power
★ nVIDIA offers SoM based small form factor packages with range of connectivity; Wi-Fi,
Bluetooth, Gb Ethernet, CAN, USB 3, 4K HDMI, 16 GB eMMC, I2C, I2S, SPI, 5 MP Camera
interface, A/V encode/decode, tons of GPIO and options
❖ TX2 (15 W, $500, 2 TFLOPS), a multi-core CPUs and 256-core GPU, 8 GB LPDDR4, 32
GB eMMC
❖ Jetson Nano (10 W, $99, 478 MFLOPS), a skimmed version with 128-core GPU, 4 GB
By
LPDDR4 & 16 GB eMMC
nVIDIA Jetson Nano SoM nVIDIA Jetson Nano SBC nVIDIA Jetson TX2 SBC
Dr. Taisir Eldos 18

Edge Computing
★ An AI processing daemon, using nVIDIA Jetson Xavier NX SoMs (System on Module)
❖ CPU: 6 ARM v8.2 Cores, 6 MB L2 / 4 MB L3 Cache (1.9 GHz if 2 & 1.4 GHz 4/6)

❖ GPU: 384 CUDA Cores & 48 Tensor Cores
❖ MEM: 8 GB LPDDR4 128-bit / 51.2 GBps, 16 GB eMMC & microSD and NVMe SSD
★
84 Tera Operation Per Second (TOPS)
Good for Edge Computing
By
★ Size & Cost:
❖ 4 x $400 + $200 = $1800
❖ 70 W in 1000 cc package
Jetson Xavier NX Board Quad Jetson Xavier NX Carrier Jetson Mate - Cluster Box
70 mm x 45 mm 110 mm x 110 mm 120 mm x 120 mm x 8 mm
Dr. Taisir Eldos 19

Back End Computing
★ Large Language Models (LLM)s require high performance platforms, as they are compute
intensive & data intensive models
★
AMD Instinct integrates large amount of cores and memory to mach this demand as back
end computers in data centers; each box is …
❖ 300 compute cores & 20,000 processors cores
❖ 192 GB & 5.3 TB/s memory
❖ 750 W & 5.2 PFLOPS
By
Dr. Taisir Eldos 20
Purpose Built Systems
★ Increasing demand on performance led chip makers to look for ways to make them at
affordable price and in reasonable time; small dies with good yield & short time to market
★
Purpose Built is a way to assemble chips to meet special needs, like Data-Centric applications
❖ QualComm Centriq, designed for performance & optimized for power to handle Data
Centers workloads
❖ NVIDIA Drive Adam System, Quad Orin, 4 x 254 > 1000 TOPS for autonomous driving;
a data center on wheels, to process large amount of data from many sensors & cameras
By
Jorda n U n ive r si ty o f Sci enceNVIDIA Orin
254 TOPS
a n d Te ch n o l og y 17 BTr. SoC
Dr. Taisir Eldos 21

Computing Platforms - Personal
★ Workstations, Notebooks, Tablets & SmartPhones
★ Wearables; Watches, Headsets & Glasses
★
Networks, Personal Area Network (PAN), Internet of Things (IoT) & Cloud services
By
ROUTER
Dr. Taisir Eldos 22
Computing Platforms - Corporate
★ Workstations, Thin Clients, Terminals, Kiosks and other equipment
★ Supercomputers, Minicomputers and/or Servers
★
Clouds, Infrastructures, Platforms & Services; Public, Private & Hybrid Clouds forms
By
On-premise Servers Cloud Servers
ROUTERS & SWITCHES
Dr. Taisir Eldos 23
Computing Platforms - SuperComputers
★ Super computer are quite powerful machines dedicated for heavy computations like scientific
research, simulation, design of complex systems
★
★
Data Centers serve huge number of users & applications over the Internet; mostly CPUs
Super Computers serve limited number of users & applications; mostly GPUs
By
Dr. Taisir Eldos 24
Computing Platforms - Factories
★ Programmable Logic Controller (PLC) is a modular special purpose automation system
★ Consist of CPU modules, I/O modules, Links, etc.
★
★
Programmed using special languages, like Ladder Diagram (LD), Instruction List (IL), etc.
Used in small control applications and large industrial plants …
❖ Controllers, traffic lights, elevators, automatic doors, car wash, remote monitoring, etc.
❖ Industry, automobile industry, oil and gas industry, equipments industry, food industry, etc.
By
a n d Te ch n o l og y PLC system
Automobile Assembly Line Silicon Foundry FOUPs
Dr. Taisir Eldos 25

System Components
Dr. Taisir Eldos

Clock Signal Generator
★ Every system needs a clock; a square wave signal with some characteristics like frequency,
duty cycle, voltage level, fall and rise times, etc.
★
Inverters have an input threshold VTH
❖ VI < VTH ➞ VO = VOH
❖ VI > VTH ➞ VO = VOL
★ Schmitt inverter has hysteresis property; it has two input threshold voltage levels
❖ VI < VL ➞ VO = VOH
★ Inerter
By
❖ VI > VH ➞ VO = VOL
VOH
VO
VOH
VO
★
❖ VTH = 1.5 V
TTL Schmitt Inverter

❖ VL = 0.8 V
❖ VH = 1.7 V VOL VOL

VI VI
★ CMOS Schmitt Inverter
❖ VL = 1.0 to 2.0 V
❖ VH = 2.4 to 3.2 V
VTH VL VH
VI VO VI VO
Dr. Taisir Eldos 27

Clock Signal Generator
★ Charging Equations: VI is Initial Voltage, VF is Final Voltage & VC is Capacitor Voltage
★ Capacitor Charge …
❖ Equation: VC = VF − (VF − VI) x e−T/RC , where VI = VL & VF = VCC
❖ TH = T when VC = VH, hence: TH = RC x ln ((VCC − VL)/(VCC − VH))

74HC14
★
Capacitor Discharge …
❖ Equation: VC = (VI − VF) x e−T/RC , where VI = VH & VF = 0
CK
❖ TL = T when VC = VL, hence: TL = RC x ln (VH/VL)
By
VC
C R
★ To achieve 50% duty cycle
VC
❖ TH = TL , hence VCC = VH + VL
VCC
❖ VL = 1.67 V & VH = 3.33 V (1/3 & 2/3 of 5 V), or
VH
★ Consider an inverter with: VL = 1.9 V & VH = 3.1 V VL

❖ TH = TL = RC x ln (3.1/1.9) = 0.49 RC
❖ T = TL + TH = 0.98 RC ≈ RC
0
CK
Time
❖ F = 1 / T = 1 / RC
VCC
★ With R = 1 KΩ & C = 2 nF Space Mark
❖ F = 500 KHz & DC = 50% TL TH
0 Time
Dr. Taisir Eldos 28

Clock Signal Generator - Pierce Oscillator
★ Crystal (Quartz) is a Silicon Dioxide (SiO2) compound with piezoelectric property
★ With the piezoelectric effect, crystals act like passive tuning forks; does not damp
★
Accuracy is measures in parts per million (ppm) or parts per billion (ppb). With ± 3.4 ppm …
❖ ± 3.4 x 10−6 x (30 x 24 x 60 x 60) = ± 8.8 seconds monthly
★
❖ ± 3.4 x 10−6 x (365.25 x 24 x 60) = ± 1.8 minutes annually
Cut precisely to act like an RLC resonance circuit with very high Q–Factor, 106 or better
❖ Precision: 100 to 10 parts per million (ppm) in the temperature range 20 to 70 ○C
By
❖ Stability: 10 to 100 ppb/○C due to heating & 10 to 100 ppb/year due to aging
★ Low frequency crystals are bulky, less precise & hard to manufacture
★
Systems use low frequency clock like 100 MHz as base & components have multipliers
❖ CPUs on-board multipliers produce 1.8 to 5.8 GHz

1 MΩ
❖ GPUs on-board multipliers produce 1.0 to 3.2 GHz
❖ Other subsystems ma have their own

1 KΩ
✦ USB, use 12 MHz to generate 48 MHz, 960 MHz
✦ Sound, 44.1 KHz, 48 KHz, 96 KHz Crystals
22 pF
✦ Ethernet; 25 MHz for 10/100 Mbps, 125 MHz for 1 Gbps
Oscillator
Dr. Taisir Eldos 29

High Precision Clocks
★ Crystal Oscillator (XO)
❖ 10 - 5, 100 to 10 ppm

❖ One second shift per week
❖ Cheap & Small for commercial apps, 7 x 7 x 3 mm

XO
★
Temperature Controlled Crystal Oscillator (TCXO)
❖ 10−6, 10 to 1 ppm
❖ One second shift per month
By
TCXO
❖ Thermal sensor to adjust frequency, 7 x 7 x 3 mm
★ Oven Controlled Crystal Oscillator (OCXO)

❖ 10−8, 100 to 10 ppb
❖ One second shift per year (instrumentation, airborne systems, etc.)

OCXO
★
❖ Temperature is kept at 100 ○C for stability, 9 x 9 x 5 cm
Laser Controlled Crystal Oscillator (LCXO)

❖ 10−11, 100 to 10 ppt
❖ One second shift in thousands of years
❖ Cost thousands of dollars, high precision apps, 20 x 20 x 7 mm

LCXO
Dr. Taisir Eldos 30

Atomic Clocks
★ Atomic clocks are crystal oscillators with precision control; Lasers or Photonics
★ Extremely precise, but bulky (closet size) & Costly (Millions of dollars) for:
❖ Timing Standards (ATI is based on 400 clocks in 69 labs around the world)
❖ Positioning, Navigation, Surveying & Air Traffic Control (Satellite Based)
❖ Financial Transactions & Securities Exchange
❖ Communications & Network Synchronization
❖ Climate, Scientific Research & Space Exploration
★ By
Fountain Clock (CFC), Cesium @ 9.2 GHz Life Expectancy
10 to 50 years
❖ 3.3 nanoseconds per year
★
❖ One second shift in 300 Million years
Optical Lattice Clock (OLC), Strontium @ 429 THz

❖ 33 picoseconds per year
❖ One second shift in 30 Billion years
Optical Lattice Clock (OLC), Ytterbium @ 642 THz

★
❖ 1 picoseconds per year
❖ One second shift in 1 Trillion years
Dr. Taisir Eldos 31

Reset Signal Generator
★ On power up, the capacitor charges according to: Vc = Vcc x (1 − e-−T/RC)
★ If the high threshold of 74HC14 is 3.2 V ( which is nearly 63.2% of 5 V final voltage value),
★
it takes RC seconds to reach, but a bit more to reach 3.6 V and trigger; like 1.1 RC
The output Vx is active high; it goes high on power up or button release, and down after
some time that depends on the R, C and the threshold voltage of the Schmitt inverter
With C = 4.7 µF & Rc = 20 KΩ, we get 20 KΩ x 4.7 µF = 94 ms pulse duration
★ If C max ratings are 12 V / 0.5 A, then Rd > 5 V / 0.5 A ≥ 100 Ω for safe discharge
★
By
Rd must be small enough to discharge before the push button is released. If the push button
time is 10 ms for example, Rd x C < 10 ms, or else the capacitor starts charging before
reaching the low threshold
5.00 V
Rc

3.6 V
3.2 V 63.2%
PB
74HC14
Vx
Vc T
Tc
Rd C
0.00 V Time
Dr. Taisir Eldos 32

𝛕
Reset Signal - Cold versus Warm
★ In one time constant, RC seconds, the capacitor charges 63.2% of 5V or 3.2 V
★ Assuming 74HC14 with VL = 1.6 V & VH = 3.2 V thresholds
★
Green curve is VC and Blue curve is VX. Initially VC is 0 hence output is High
❖ Green charges to VH, causing the inverter output to go Low ending the pulse
❖ Green discharges to VL, causing the inverter to start and end another pulse
❖ Red discharges slowly (due to high Rd) and does not get to V−, no reset pulse generated
★ Cold & Warm Reset

By
❖ Cold reset, when the power supply is turned on; Tc ≈ RC, depends on VH inverter type
❖ Warm reset, when the power supply is on and the push button is pressed; Tw ≤ Tc
The 555 timer circuit requires 2 resistors & 2 capacitors to construct a robust pulse generator
❖ Operates at 5 V to 15 V
5.0 V VCC
❖ Robust; 50 years of reputation

❖ Popular; Billion sold every year
❖ Cheap; few cents only

3.2 V VH
❖ 8-pin chip has single timer
❖ 14-pin chip has dual timer

1.6 V
PBP
VL
❖ Available in CMOS too Tc Tw

0.0 V Time
Dr. Taisir Eldos 33

555 Timer Based Reset Generator
★ The 555 timer circuit 3 resistors voltage divider to generate 1.67 V & 3.33 V reference points
★ The output goes high on power up or button release & goes down when C charges to 3.33 V
★
★
Hence T = 1.1 x R x C because it occurs when 5 x (1 − e −T/RC) = 3.33 V
Rx (1 MΩ) initiates the pulse on power up or button release & Cx (10 nF) eliminates noise
❖ Compute the pulse duration with R = 18 KΩ / 5% & C = 6.8 µF / 20%
✦ T = 1.1 x 18 x (1− 0.05) x 6.8 x (1 − 0.20) = 102.3 ms (Good for 100 ms requirement)
✦ Note that electrolytic capacitors lose value over time due to liquid evaporation
By
❖ Compute the 10% resistor value that generates 50 ms using C = 4.7 µF / 15%
✦ 50 ms = 1.1 x R x (1− 0.10) KΩ x 4.7 x (1 − 0.15) µF, hence R = 12.64 KΩ
✦ If only 12 KΩ, 15 KΩ, 27 KΩ are available, pick 15 KΩ (Need at least 12.64 KΩ)
5V

GND
TRIG
OUT
1
2
3
8
7
6
VCC
DIS
THR
Rx
RST
VCC
THR
R
T
TRG DIS
RST 4 5 CONT LM555
CON OUT
PB Cx GND C
GND
Dr. Taisir Eldos 34

Central Processing Unit (CPU)
★ DB: 4, 8, 16, 32, 64
★ AB: 12, 14, 16, 20, 24, 32, 36, 40, 41, 42, 43, …, 48, 50 & 52.
★
★
Address Space: Kilo = 210, Mega = 220, Giga = 230, Tera = 240, Peta = 250, Exa= 260, Zetta= 270
CB: 10’s to 100’s of signals
210 = 1024 = Ki
❖ Input: Clock, Reset, Interrupt, Wait, Bus Error, Bus Request, etc.
❖ Output: Clock, Address Strobe, Data Strobe, Read, Write, Halt, etc.
103 = 1000 = K
❖ Multiplexed: Input/Output due to pins shortage, like Reset & Halt in the MC68000
★ By
PB: 5.0, 3.3, 2.0, 1.9, 1.8 V, 1.5, 1.2 V, …, with dynamic voltage scaling it cover a range:
❖ CPU Cores: 0.7 – 1.3 V (low core count), (0.6 – 1.1 V (high core count) & 1.5 V (Gaming)
❖ CPU Logic: 1.2 – 1.5 V
❖ GPU: 0.6 – 1.0 V

❖ MEM Controller: 1.2 – 1.5 V
❖ I/O Logic: 1.8 – 3.3 V

DB
AB
Packages: DIL, PLCC, SMT, PGA, BGA, LGA …

★
CB*
★ Pin count: 16, 18, 28, 40, 48, 52, 64, 100s, 1000s CB*
★ AB & DB of 48 & 64 implies 23x248 = 251 = 2 PB
Dr. Taisir Eldos 35

Read Only Memory (ROM, PROM, EPROM, EEPROM)
★ DB: 4, 8 and 16
★ AB: 4, 5, 6, …, 20 (More recently …)
★
CB: CS*/CE* (Chip Select/Enable), OE* (Output Enable or Read) and PGM* (for
Programming or Write using a programming device)
PB: GND, VCC (5.0 V), VPP = 12 V (for programming)
★
❖ Mask ROM (MROM), hardcoded with data from the factory
❖ Programmable ROM (PROM), blank to program in house using a special equipment
By
❖ Erasable Programmable ROM (EPROM), UV Light erased, and programmed many times
❖ EEPROM, Electronically byte erasable and Flash is the same but block erasable. Why?
★
The 2764 is 28-pin 8 KB EPROM (8Kx8b), why not 27 pins?
❖ AB = 13 (8 K implies 3 + 10)

❖ DB = 8 DB
❖ CB = 3 (CS*, OE*, PGM*) 2764
AB
CS*
❖ PB = 3 (VCC, GND & VPP in some chips)
★
Unused pin! NC, VCC, GND, CS2* …
Some chips have dual function pins, like CE*/VPP
OE*
PGM*
Dr. Taisir Eldos 36

Flash Memory (NOR / NAND)
★ Uses 1 floating gate transistor per cell; to trap and release electrons under high voltage
★ NOR is less dense, random access at the byte/word level good for code; eXecute In Place (XIP)
★
★
NAND is denser, sequential, cheaper good for data; formatted as block, pages, etc.
A chip consists of: dies, planes, blocks, pages, strings of bits, and requires erase before write
★
Units of erase and write are blocks not bytes
Millions of erase/program cycles & decades of retention SCLK
Standard Bus SDI
By
★
SDO
❖ DB of 4, 8, 16
CS*
❖ AB of 4, … up to 20 or more
★
❖ CB: CS*, OE* and WE*
Serial Peripheral Interface (SPI) for low pin count packages

❖ Serial Clock & Select: SCLK & CS*
❖ Serial Data In & Serial Data Out: SDI & SDO

DB
AB
CS*
PB: GND and 1.8 V, 3.3 V, 5.0 V (25 V generated internally)
★
OE*
★ Packages: DIL & TSOP with 4, 8, 16, 18, 20 and 42 pins WE*
★ 3D-NAND or V-NAND is reliable, cheap, small and low power
Dr. Taisir Eldos 37

Static Read Access Memory (SRAM)
★ Uses 4 to 6 transistors per cell or bit structure (even 8 and 10 for some applications)
★ The misleading “Random Access” used here means accessing any random location takes the same
★
amount of time. This is opposed to Direct Access in which it depends on where the data is stored
DB: 4, 8 and 16
AB: 8, 9, …, 20 (More recently …)
★
★ CB: CS*/CE*, OE* and WE*

❖ CS*/CE*: Chip Select/Enable, there can be many active low and active high
By
❖ OE*: Output Enable, to read data out
❖ WE*: Write, to write data in
★ PB: GND and 5.0 V (Today, many work on much less like 2.0 V)
★ Packages: DIL and SMT with 24, 28, 32, 40 pins (8-pin chips uses serial bus)
★ The 62128 is a 28-pin 16 KB SRAM (16 Kx8b), again not 27 !
❖ AB = 14 (16 K implies 4 + 10)
❖ DB = 8 6212
8
DB
AB
CS*
❖ CB = 3 (CS*, OE*, WE*) OE*
❖ PB = 2 (VCC, GND) WE*
★ What is the capacity of a 32-pin 8-bit data SRAM chip? 219 = 512 KB
Dr. Taisir Eldos 38

Dynamic Read Access Memory (DRAM)
★ 1 transistor per cell instead of 4; to control a charge on a capacitor, slower but cheaper
★ DB: 1, 2, 4, 8, 16
★
★
AB: 8, 9, …, 16, 17, 18 (18x2 = 36, yields 236 = 64 GB, Today: 8 GB x 8 Dies = 64 GB
CB: CS*, OE*, WE*, CAS* and RAS*
❖ CS*/CE*: Chip Select/Enable, there can be many active low and active high
❖ OE*: Output Enable, to read data out
❖ WE*: Write, to write data in
By
❖ RAS*: Row Address Select; latches the upper half to select a page
❖ CAS*: Column Address Select; latches the lower half
★
PB: GND and 5.0 V (3.3 V, 2.0 V, 1.2 V, 1.1 V & 1.0 V today)
Packages: DIL & SMT with 16, 18, 20 & 40 pins
DB
AB
★
How many pins in 4 GB DRAM, assuming 4-bit wide? 28
❖ Format: 4 GB = 8 G x 4 b, AB = ⌈33/2⌉ = 17; 2 steps only
CS*
OE*
WE*
❖ DB = 4 RAS*
❖ CB = 5 (CS*, OE*, WE*, CAS*, RAS*) CAS*
❖ PB = 2 (VCC, GND), more for high density chips
Dr. Taisir Eldos 39

Parallel Input Output (PIO)
★ Integrates 2 or 3 ports to be programmed to as input or output
★ Ports are Byte, Nibble or Bit programmable, depending on vendor and purpose
★
Used as peripherals, hence use the slow synchronous mode but can use the asynchronous
mode too
They have reset input to initialize ports as inputs to avoid damage; there will be a content that
★
leads to damage if a port is randomly set as output while connected to an input device
★ May have Interrupt Output, to notify the CPU when an action is complete or to be requited
★
By
Naturally, they have a data bus to communicate with the CPU, few address lines to select a
port and read/write control and a chip select (from a decoder)
★ Different vendors have chips with different flavors
❖ Parallel Input Output (PIO); Zilog DB PA
❖ Parallel Peripheral Interface (PPI); Intel AB PB

❖ Parallel Interface Adaptor (PIA); Motorola
CS* PC
OE*
WE*
a n d Te ch n o l og y RST*
INT*
Dr. Taisir Eldos 40

Serial Input Output (SIO)
★ Integrates one or two serial communication channels, bit rates from 1.2 Kbps to 1.5 Mbps
★ Serial communications used to link terminals or printers to mainframes over 100’s feet using
★
+12/−12 V drivers as opposed to standard parallel 0/+5 V
Each channel has Transmit, Receive and Handshaking signals (on the right side)
Has interrupt output, to signal events like rather received or sent
★
★ Some chips have FIFO buffer for each direction; 16, 32, …, 128
★ Different vendors have chips with different flavors
By
❖ Asynchronous Communication Interface Adaptor (ACIA, ACA); Motorola
❖ Universal Asynchronous Receiver Transmitter (UART); National SC
DB TxD
❖ Dual Asynchronous Receiver Transmitter (DART); Zilog
AB RxD
★ Asynchronous serial protocol may use: CS*

❖ 5-wire cable with hardware handshaking OE* CtS*
WE* RtS*
❖ 3-wire cable with software handshaking
RST*
★ Chips provide more controls for modems INT*
a n d Te ch n o l og y RxC
TxC
Dr. Taisir Eldos 41

Programmable Counter Timer (PCT)
★ Integrates few counting modules; typically three 16-bit counters, and provides many modes
of operation; Pulse generator, Square wave generator, etc.
★
★
Output of module can be used as a clock for another to form 32-bit or even 48-bit counters
Mostly byte oriented low operating frequency used in timing signal generation like periodic
interrupt for multitasking
Typical chips have three channels or counting elements, called contain modules
★ There can be more chip specific controls in some chips; like Clock, Reset, etc.
★
By
Different vendors have chips with different flavors
❖ Programmable Interval Timer (PIT); Intel
❖ Programmable Timer Module (PTM); Motorola
❖ Counter Timer Channels (CTC); Zilog

DB
Each counting module, M0, M1 and M2 has:
★ M0
AB
❖ 2 inputs: Clock & Gate (Enable) CS*
M1
❖ 1 output: Out OE*
★
★
Modules are totally independent, each operates in any mode
Any module can generate an interrupt when done
WE*
INT*
M2
★ Can be cascaded; to make 32-bit or 48-bit modules
Dr. Taisir Eldos 42

Real Time Clock (RTC)
★ Integrates a clock generator, few bytes of read/write memory and some control logic, along
with 40 to 60 bytes of non-volatile Storage (NVS) to keep the time & date (for few dollars)
★
The ticking logic gets power from VCC when the system is on, or VBB otherwise
❖ Lithium Battery, small coin like non-rechargeable 3 V / 100s mAH, good for few years
❖ SuperCapacitor (UltraCapacitor), few minutes of charge gives months of operation
★ To generate the 1 Hz ticker with high precision; 10 or 20 ppm, we use

❖ 215 = 32,768 Hz, need 15 FFs to divide to get 1 Hz
★
By
❖ 222 = 4,194,304 Hz, need 22 FFs to divide to get 1 Hz
Why those oddball numbers? 32,700 Hz, 32,000 Hz, etc. need complex next state logic
❖ 213 = 8,192 Hz? Less FFs compared to 215, but less precise
❖ 212 = 4,096 Hz? Less FFs, but less precise, more power, bulky, fragile & hum
XTL1

Host reads information over I C bus for data exchange
★ 2
XTL2
★ OUT can be set to show: 1 or 32,768 Hz SCL
SDA
★ Some systems get informations form other sources:
❖ Radios, Cellular communication, Internet, etc.
❖ Chipset Integrated function, Intel’s Mobos

OUT
VBB
Battery Capacitor
Dr. Taisir Eldos 43

Glue Logic: Decoders & Encoders
★ Glue logic provides physical support functions for data and control flow in the system
★ Decoders select one out of many devices, memory or input/output chips, for a transaction
★
Decoders can be cascaded support more chip select controls and have more control over the
space. This can be achieved using
❖ DEC, 2-to-4, 3-to-8, etc. are binary decoders in one or two stages normally
❖ ROM or PLA, flexible but slow and needs programming step which is an added cost
★ Encoders arbitrate events like interrupts; report the code of the highest priority active device
By
request to serve; the lowest is always active to report no request.
D r. Ta i s i r E l d os E1
E2*
Y0*
Y1*
E* I0*
I1*
E3* Y2* I2*

E* Y0* A
Y3*
Y4* A*
I3*
I4*
Y1* B Y5* B* I5*
A
B
Y2*
Y3*
C Y6*
Y7*
C* I6*
I7*
74LS
74LS
74LS
139
138
148
Dr. Taisir Eldos 44
Glue Logic: Buffers & Latches
★ Buffers are used for two reasons
❖ Physical, to resolve the fan out issues by strengthening the signal power, it is an electronic

function although might be inverting or none-inverting. Buffers allow low driving power
devices like CMOS to drive many higher power ones like TTL
❖ Logical, to read specific input to the data bus when enabled, as an input port
Address bus is unidirectional and hence needs unidirectional buffers; each 4 have one Enable
★ Data bus needs bidirectional buffers (bus transceivers), hence Enable & Direction controls
★
By
Latches or Flip-Flops are used as output ports, data on the data bus is written into by
activating the clock, to be read by another party when output is enabled
74LS
74LS
74LS
374
244
245
8 8 8 8 8 8
D Q A B A1 B1 A B
D D Q Q B
E1*
OE* CK CK
E1* E* A
CK OE* E2* A2 B2 D
E* D
E2*
Dr. Taisir Eldos 45

Programmable Logic Devices
★ Programmable Logic, are structures with Thousands to Billions of transistors, that can be
programmed to achieve specific functions. Generally, Programmable Logic Devices (PLDs)
★
PLDs are used to implement complex logical expressions and even complex systems
❖ Programmable Logic Arrays (PLA)
❖ Field Programmable Gate Arrays (FPGA)
❖ Application Specific Integrated Circuits (ASIC)
★ PLAs are generally used to implement simple logic expressions while FPGAs and ASICs
By
consist of huge number of complex blocks and interconnection network managed by
switches, and hence can be used to implement complex systems like controllers or processors
★ PLA structure
❖ AND/OR sections, with programmable connections
❖ Fuses are initially robust (Red) and making connection, the blown are (Blue) to disconnect

❖ AND sections generate product terms
❖ OR sections generate sum of products
❖ XOR takes the complement of a function
❖ Variables passed true and complemented, and only one can be taken (if any)
❖ To invert a function, AND/NOR connect to 1, else to 0 (exactly one of them)
Dr. Taisir Eldos 46

PLA Structure (6-input & 10-output Example)
A A
B
C
Mi cr o pr o c e sso r S yst ems B
C
D
E
D ra f t L e c tu re s D
E
B: Blown
F
By R: Robust
F
= (A’•1)(B’•1)(1•C)(1•1)(1•1)(1•F)
D r. Ta i s i r E l d os = A’B’CF
Jorda n U n ive r si ty o f Sci ence F1 = A’BE’F+A’B’CF
F2 = (BCD+AC’E’)’
F10 = ?
A’BE’F BCD AC’E’ A’B’CF
Dr. Taisir Eldos 47

Processor Power Module
★ Motherboards host processors, chipsets, support components, and slots of various kinds to
host memory modules, storage modules, graphics cards, ports, et. Each needs specific voltage
★
Processor Power Modules (PPMs) or Voltage Regulator Modules (VRMs) are local power
supplies; step down 5 or 12 V DC to stable reliable DC lower voltage (using buck converters)
VRMs consist of: Pulse Generator, MOSFET Switches, Chokes & Capacitors
★
★ It has a feedback control to adjust the duty cycle, pulse width, to stabilize the output
★ Output based on input identification code VID (5, 6 or 8 bits to specify the requited value)
★
★
By
Codes may imply: 0.55, 0.56, 0.57, …, or 3.0 V (Some codes reserved for control; shut off)
Switches & Chokes make phases; more phases …
❖ Larger currents using cheaper components
❖ Less ripple, noise & heat
An 8 + 2 PPM has 8 CPU & 2 MEM phases

★
★ VRMs may have 16 phases to supply up to 1000 A

❖ A phase may supply 15 to 60 A
❖ A light core requires 2 to 4 W & 5 A
❖ A power core requires 8 to 16 W & 15 A
Dr. Taisir Eldos 48

Power vs. Frequency & Voltage
★ The power consumption is directly proportional to frequency and square of voltage
★ So, P = KFV2; the constant K depends architecture, design, fab process, complexity, etc.
★
A processor with: F = 2.4 to 3.6 GHz, V = 0.8 to 1.4 V & Pmax = 37 W
❖ 3.0 GHz & 1.4 V
✦ P = 37x(3.0/3.6)x(1.4/1.4)2
✦ P ≈ 31 W
Voltage (V) Power (W)
❖ 3.0 GHz & 1.1 V
By
2.0 50
✦ P = 37x(3.0/3.6)x(1.1/1.4)2 1.8 45
1.6 40
✦ P ≈ 19 W
1.4 35
★ A processor with: 1.2 30
❖ F: 3.2 to 4.4 GHz (normally, 3.8 GHz) 1.0 25
0.8 20
❖ V: 0.9 to 1.3 V (normally, 1.1 V)
❖ Normally, consuming 15 W
0.6
0.4
Find K …
50 = K x 4 x 1.2 x 1.2
15
10
Compute 0.2 K = 8.7 Ω GHz
-1 -1 05
★
0.0 00
❖ Minimal power, Pmin
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
❖ Maximum power, Pmax
Frequency (GHz)
Dr. Taisir Eldos 49

Buck Converters
★ Using a voltage divider or zener diode to step down voltage neither efficient nor flexible
★ A buck converter is an efficient and flexible way to get stable voltage with the required value
★
★
The sawtooth is the choke current during HS/LS FETs activation (mutually opposite)
VOUT = DC x VIN. If VIN = 5 V, then: VIN
❖ 25% duty cycle yields VOUT = 1.25 V
❖ 50% duty cycle yields VOUT = 2.5 V
4 A at 2.5 V means 2 A at 5 V (100% efficiency) VOUT

By
★
★ 4 A at 2.5 V means 2.5 A at 5 V (80% efficiency) Time
VIN
D r. Ta i s i r E l d os P
VIN
Time

Drivers
HS FET
VOUT VOUT
P
LS FET
Time
P Time
Dr. Taisir Eldos 50

Buck Converter - Multi Phase
★ While output voltage is controlled by the duty cycle, more phrases increase current capacity
★ A chock and two control switches, HS FET & LS FET, constitute a phase
★
★
Pulse controller generates 100 KHz to 10 MHz overlapping or non-overlapping pulses
Higher frequency switching produces smoother output but causes electromagnetic
VIN
interference that requires design care
Phase #1
VOUT VIN
By P1
Pulse Generator
Phase #2
VOUT
P2
Time
Phase #3

P3 P1 Time
P2
Time
3-phase non-overlapping
with 20% duty cycle P3
Time
1 V output using 5 V input
Dr. Taisir Eldos 51

Software Model
Dr. Taisir Eldos
“It is soft; you can swallow it” - Terry Eldos

Introduction
★ How to tell a machine to do something? A task ….
❖ Binary, sequence of bytes or words; hard to write, read, understand, comment, etc.
❖ Hex, direct mapping of every nibble Binary to Hex
❖ Symbolic, use mnemonics to represent instructions
★
❖ Assembly, symbolic language with directives (pseudo-instructions)
Assembly enhances reuse and code readability by using labels, comments, etc.
Colors indicate matching code in the all levels; black stuff is pseudo; no translation
By
★
Line Lablel Code Comment Address Binary Hex
2 D r. Ta i s i r El dos
code
org $1200
ld a, init
Place Code at $1200
Initialize Counter
$1200
$1201
0011 1110
0100 0111
$3E
$47
3 loop nop Delay $1202 0000 0000 $00

Assembler
4 dec a Decrement Counter $1203 0011 1101 $3D
5 jp nz, loop Exit Counter on Zero $1204 1100 0010 $C2
an d Tech no lo gy
6 init equ $47 Initial Count $1205 0000 0010 $02
7 end code $1206 0001 0010 $12
Intel 8080 Assembly Code
Dr. Taisir Eldos 2

Memory Organization
★ Memory can be viewed as a matrix; rows of data items of specific width
★ The width in bits times the number of row is the capacity
★
Wider implies fewer addresses; a 32 bits of information can be organized as:
❖ 2 x 16b, Word wide; 1 Address bit C15
B 31
❖ 4 x 8b, Byte wide; 2 Address bits
❖ 8 x 4b, Nibble wide; 3 Address bits

C9
10
9
❖ 16 x 2b, Crumb wide; 4 Address bits
By
❖ 32 x 1b, Bit wide; 5 Address bits
N7
C8
C7
8
7
N6 C6 6
B3
B2 N5 C5 5
B1 N4 C4 4

B0 N3
N2
C3
C2
3
2
W1 N1 C1 1
an d Tech no lo gy
W0 N0 C0 B 0
LW0
Dr. Taisir Eldos 3

Memory Organization & Indexing
★ Memory organized as Byte Wide, Word Wide, Long Word Wide, Very Long Word Wide
★ Byte indexing allows addressing individual bytes but need extra bits to select a byte in a word
★
A 256 GB memory, formatted as 64 G x 32 b requires
❖ Log2 (64 G) = 6 + 30 = 36 address bits to select a word, and
❖ Log2 (32 / 8) = 2 address bits to select a byte within a word
❖ A total of 36 + 2 = 38 address bits; which gives 2 ^ 38 = 256 GB
By
A1A0 A1A0 A1A0 A1A0
00 B 00 W 00 LW 00 VLW
01 01 01 01
10
11
10
11
10
11
10
11

A2A1A0
00 B1 B0
A3A2A1A0
00 B3 B2 B1 B0
A4 A3A2A1A0
00 B7 B6 B5 B4 B3 B2 B1 B0
an d Tech no lo gy
Byte Indexing 01 01 01
10 10 10 $48
M(10011) = $48 11 11 11
1 0 11 10 01 00 111 011 101 100 011 010 001 000
Dr. Taisir Eldos 4

Endianness (Data Mapping)
★ Processor Endianness is how it maps multiple byte register content to byte indexed memory
★ A processor supports Little Endian (LE), Big Endian (BE) at one time
❖ LE: Lower Data maps to Lower Address
❖ BE: Higher Data maps to Lower Address
★
When a 4-byte data is copied to address $1002, it has to be speed over $1002, 3, 4 & 5
❖ LE, B0 goes to $1002
❖ BE, B0 goes to $1005
CPU REGISTER
By MEM MEM CPU REGISTER
B3
B2 B1 B0 $1000
$1001
$1000
$1001
B3 B2 B1 B0
B0 $1002 $1002 B3

B1
B2
$1003
$1004
$1003
$1004
B2
B1
an d Tech no lo gy
Low Address ↔ Low Data
B3 $1005
$1006
$1005
$1006
B0
Low Address ↔ High Data

$1007 $1007
Little Endian Big Endian
Dr. Taisir Eldos 5

Endianness & Alignment
★ If registers are multi-byte and memory is byte indexed then there has to be a mapping:
❖ Little Endian (LE), maps lower order data item to lower order memory address
★
❖ Big Endian (BE), maps lower order data item to higher order memory address
The Endian term also applies to:
❖ File formats; when an application stores a multi-byte or multi-word data items, and this is
why we have standards
❖ Networking & Serial Transmission; least or most significant bit transmitted first in time
★ Examples By
❖ LE: ARM & Intel x86 architectures
❖ BE: Motorola 68K & Sun SPARC architectures
❖ LE/BE: MIPS architecture, support both using different implementations

❖ LE & BE: PowerPC 601 architecture is switchable, one implementation supports both
★ Endianness switching can be done in one of two ways

❖ Either by software, during operation
★
an d Tech no lo gy
❖ At start up, using some motherboard setting jumper
Alignment is about allowing or disallowing multiple byte at odd address; data fragmentation
Dr. Taisir Eldos 6

Memory Alignment
★ When accessing multiple bytes at byte indexed memory, we need to consider fragmentation.
★ Regarding this, processors are either:
❖ Aligned: does not allows fragmentation; fast but may waste memory (Temporal advantage)
❖ Unaligned: allows fragmentation; compact but may be slower (Spatially advantage)
★ Example
❖ Define the following constants, at
❖ Address $1000 01 23 1000 01 23 1000

✦ $0123,
✦ $45,
By 89
A4
45 1002
67 1004
−
67
45 1002
89 1004
83 79 1006 79 A4 1006
✦ $6789,
✦ $A4,
24 1008
Little Endian
24 83 1008
Little Endian
✦ $79,
★
✦ $2483
Dash indicates skipped location

Unaligned Aligned
Skipped locations are just left alone, and can still be accessed by their addresses
Fragmented words requires two transactions to access
★ Grouping; words then bytes or bytes then word resolve the spatial issue
Dr. Taisir Eldos 7

Assembler Directives: EQU & ORG
★ Assembler directives or pseudo-instructions are meant to direct the Assembler as how to
assemble the code, like:
❖ EQU, Equate: binds a name to a value
❖ ORG, Origin, location counter: where to place Code and Data in memory
❖ END, Indicate the end of program
★ Consider this piece of code, assuming Big Endian processor 1000
By
1002
Many EQU 32 1004
More EQU $48 1006 12 3C
Code ORG $001006 1008 00 20
MOVE.B #Many, D1 ; $123C, $0020 100A 7A 48
MOVEQ #More, D5 ; $7A48 100C
★
Instruction MOVE.B #$20, D1 in binary coding is translated to $123C0020
anis d Tech noas theloprocessor

gy is aligned
❖ Instruction word $123C is for the instruction opcode, size, addressing modes
❖ Instruction word $0020 for source literal (Many),
★ Instruction MOVEQ #48, D5 is encoded as $7A48
Dr. Taisir Eldos 8

Assembler Directives - DC & DS
★ Another two important directives are
❖ DC, Define Constant, allocates memory and loads with data at compile time
★
❖ DS, Define Storage, allocates memory to be used at run time
Consider the piece of code (For a Big Endian Processor)

1000
1002 − −
ORG Dra f t L ec tu re s
$1002
1004 −
1006 −
X
−
BArray DS.B 3 ; Allocate 3 bytes 1008 − −
WArray
BData
DS.W
DC.B
2
12
By
; Allocate 2 words
; Allocate a byte and write $0C
100A 0C
100C 00
X
0C
DC.W 12 ; Allocate a word and write $000C 100E 41 42
Message DC.B “ABC 123” ; Allocate bytes for ASCII string and write 1010 43 20
; $41 for “A”, …, $20 for “ ”, … and $33 dor “3” 1012 31 32

Marks
ORG $1018
DC.L 23, $23 ; Reserve & Write $00000017 & $00000023
1014 33
1016
1018 00 00
★
★
an d Tech no lo gy
Labels are used as friendly alternatives to addresses
To access the string “ABC 123”, we use Message as pointer
101A 00
101C 00
17
00
101E 00 23
Dr. Taisir Eldos 9

Memory Alignment - Waste
★ The DC secretive allocates memory and write data
1000 13 X
★ To reduce memory waste, re-arrange by packing bytes
1002 12 34
★ Assemblers can re-arrange (or the programmers) 1004 24 X
1006 56 78
ORG $1000 ORG $1000 1008 35 X B2
B1 DC.B $13 W1 DC.W $1234 A1 B2 3 different places;
100A
$1004, $1009, $1001
W1 DC.W $1234 W2 DC.W $5678 100C 46 X same data
B2
W2
DC.B
DC.W
$24
$5678
W3
W4
DC.W
DC.W By$A1B2
$C3D4
100E
1010
C3 D4
AB
B3 DC.B $35 B1 DC.B $13
W3
B4
DC.W
DC.B
$A1B2
$46
B2
B3
DC.B
DC.B
$24
$35
1000
1002
12 34
56 78
1000
1002
13 24
35 46
W4 DC.W $C3D4 B4 DC.B $46 1004 A1 B2 1004 AB X

B5 DC.B $AB B5 DC.B $AB 1006
1008
C3 D4
13 24
1006
1008
12 34
56 78
100A 35 46 100A A1 B2
Xs have no labels, an d Tech at run time no lo gy

★ Locations marked by X are skipped for alignment 100C AB 100C C3 D4
❖ but accessible
❖ Re-arranging, Words/Bytes or Bytes/Words, eliminates or reduces waste
Dr. Taisir Eldos 10

MC68000 Programmer’s Model
★ General Purpose Registers (Light Gray)
D7
❖ Data Registers
✦ 8 x 32-bit registers, D0, D1,…, D7
✦ L, W, B segmentation
D0
❖ Address Registers
✦ 7 x 32-bit registers, named A0, A1, …, A6

A6
✦ L, W segmentation
★
By
Special Purpose Registers (Dark Gray)
❖ Stack Pointers, no segmentation
A0
✦ 32-bit, User Stack Pointer (A7, USP)
✦ 32-bit, Supervisor Stack Pointer (A7’, SSP) A7

❖ Program Counter (PC), no segmentation

A7’
✦ 32-bit, also called Instruction Pointer (IP)
− PC
✦ 24-bit address bus (MSB not used)
an d Tech no lo gy
❖ 16-bit Status Register (SR)
✦ 8-bit System Byte (SB)

SB CCR SR
✦ 8-bit Condition Code Register (CCR) T − S − − I2 I1 I0 − − − X N Z V C SR
Dr. Taisir Eldos 11

CCR or Flags
★ Consider the 4-bit Binary Adder below, to understand what the flags are about
★ The XOR acts like a MUX, and the control input A’/S dictates the operation:
❖ A’/S = 0, XORs pass B along with carry of Cin = 0 to add, hence F = A + B
❖ A’/S = 1, XORs pass B’ along with carry of Cin = 1 to subtract, hence F = A – B
★
Spilling means outcomes does not fit the size, and expressed through C (assuming inputs are
unsigned numbers) & V (assuming inputs are signed numbers)
B3 A3 B2 A2 B1 A1 B0 A0
By A’/S
C FA FA FA FA Cin

V
N
an d Tech no lo gy
Z
F3 F2 F1 F0
Dr. Taisir Eldos 12

Addressing Modes
Dr. Taisir Eldos

Addressing Modes - Symbols and Notation
★ Addressing modes are the methods by which the instructions access their operands
★ We use Register Transfer Language (RTL) or to describe operations at the hardware level
★
Assume that :
❖ <s> is the source operand, and
❖ <d> is the destination operand
★ Then, the ADD & MOVE instructions of the processor are described as in the comment section
ADD
MOVE
By
<s>, <d>
<s>, <d>
; d ← d + s add d to s and store into d
; d ← s store copy of s into d
★ D r. Ta i s i r El dos
Here the source operand comes first, some Assemblers use destination first
★ Data Types
❖ $ means Hexadecimal
❖ @ means Octal
an d Tech no lo gy
❖ % means Binary
❖ ‘ …’ means ASCII
❖ No prefix means Decimal
Dr. Taisir Eldos 14

Addressing Modes …
No. Addressing Mode Description
1
2
Literal
Absolute.W
Immediate number
Direct or Absolute Short (Word Address, to sign extend)
3 Absolute.L Direct or Absolute Long (Longword Address, full address)
4 Di Data Register Direct
5 Ai Address Register Direct
6
7
(Ai)
(Ai)+
By Address Register Indirect
Address Register Indirect with Post-increment
8
9
−(Ai)
(d16, Ai)
Address Register Indirect with Pre-decrement
Address Register Indirect with Displacement

10
11
(d8, Ai, Xj)
(d16, PC)
Address Register Indirect with Displacement & Index (X is D or A)
Program Counter Relative with Displacement
12
13 an d Tech no lo gy
(d8, PC, Xj)
Embedded
Program Counter Relative with Displacement & Index (X is D or A)
3-bit / 8-bit immediate number encoded within the instruction
14 Implied SR, CCR, USP, PC
Dr. Taisir Eldos 15

Immediate or Literal
★ In this mode, the actual operands follow the instruction
★ Allows a constant to be setup when program is written
★
★
The # is used to tell the Assembler “its immediate”
Typical application to setup control loops and delay counters
★ Example
MOVE.B #$83, D3 ; D3(7:0) ← $83
MOVE.W
MOVE.L
By
#$83, D3
#$83, D3
; D3(15:0) ← $0083
; D3(31:0) ← $00000083
MOVE.L D r. Ta i s i r El dos
#$1A483, D3 ; D3(31:0) ← $0001A483

MOVE.B #$100, D3 ; Syntax error
; Immediate value $100 exceeds the byte capacity
; Must be in the range 0 to 255 or -128 to +127
MOVE.B
an d Tech no lo gy
D3, #$83 ; Syntax error
; Immediate addressing mode makes no sense as destination
; As destination must be alterable; register or memory
Dr. Taisir Eldos 16

Absolute or Direct
★ Instruction contains the operand’s address not its value
★ Long, 32-bit address, accesses 16 Mbytes
★
Short, 16-bit signed, to be signed extended internally
❖ Sign = 0, upper word is 0s; range is $000000 − $007FFF (Lowest 32KB block)
★
❖ Sign = 1, upper word is 1s; range is $FF8000 − $FFFFFF (Highest 32KB block)
If sign extending a word address changes its value then it has to go long
Short takes less space and time; better if fits; Assemblers decide
By
★
FF8000 − FFFFFF
FF0000 − FF7FFF
MOVE.L D3, $17004 ; M($017004) ← D3(31:16); M($017006) ← D3(15:0)
MOVE.W D3, $7234
; Two transactions, High oder data first (BE), Long Abs
; M($007234) ← D3(15:0)
FE8000 − FEFFFF
FE0000 − FE7FFF
; Short fits because SE($7234) = $007234

MOVE.W D3, $8234 ; M($008234) ← D3(15:0) 018000 − 01FFFF
MOVE.W D3, $8234.w ; M($FF8234) ← D3(15:0
an d Tech no lo gy
010000 − 017FFF
; Sign Extending $8234 yields $FF8234
008000 − 00FFFF
; So, if the address is $008234 it has to go long
; Otherwise it will be considered $FF8234 000000 − 007FFF
Dr. Taisir Eldos 17

Register Direct
★ Does not involve memory address, hence so fast
★ Effective address of operand is the register name
★
The MC68K data path is 32 bits. Hence register-register transfer takes 4 clock cycles regardless
of the size; it is done as a single micro-operation
★ Examples
MOVE.L D0, D1 ; D1(31:0) ← D0(31:0)

MOVE.W
MOVE.B
D0, D1 ; D1(15:0) ← D0(15:0)
D0, D1 ; D1(7:0) ← D0(7:0) By
★
★
Direct Address Register is not allowed as destination of MOVE
A dedicated instruction called MOVEA (Assembly restriction not processor OpCode)

MOVEA.L
MOVEA.W
D0, A0 ; A0(31:0) ← D0(31:0)
D0, A0 ; A0(15:0) ← D0(15:0)
MOVEA.W
MOVEA.B
an d Tech no lo gy
A1, A0 ; A0(15:0) ← A1(15:0)
D0, A0 ; Syntax error, Address registers can not be byte sized

MOVEA.W D0, D1 ; Syntax error, Only address registers allowed for MOVEA
Dr. Taisir Eldos 18

Address Register Indirect
★ Specified by enclosing the address register in parentheses
★ Fast, address is in the CPU and can be dynamically changed
★
★
Application: arrays, records, link lists, etc
Processor state is usually in hexadecimal even without the prefix $
★
Examples, Big Endian processor
1000 12 34
A1 = $1000
By
1002 57 30
A5 = $1002
1004
A6 = $1008
1006
D4 = $31295730
MOVE.W
(A1), D3 ; D3(15:0) ← M(A1)
1008
100A
31
57
29
30
100C

; D3(15:0) = $1234 & D3(31:16) unchanged
100E
MOVE.W D4, (A5) ; M(A5) ← D4(15:0)
MOVE.L
an d Tech no lo gy
D4, (A6) ; M(A6) ← D4(31:0)
; Ok to write it this way for the sake of learning, but of implementation …
; M(A6) ← D4(31:16) then M(A6+2) ← D4(15:0)
Dr. Taisir Eldos 19

Address Register Indirect with Post-increment
★ Auto adjustment provides faster access to structured data items; tables, arrays, etc.
★ Increment by 1 for .B, 2 for .W and 4 for .L instructions, hence less time and space
★
★
Exception is A7 (USP) and A7’ (SSP), where 2 is used for .B, preserve alignment
Note that RTL uses one statement for .L sized memory accesses, but in fact it done done in
two cycles because it’s a word sized data bus
MOVE.L (A0)+, D3 ; D3(31:0) ← M(A0); A0 ← A0 + 4

MOVE.W
MOVE.B By
(A0)+, D3
(A0)+, D3
; D3(15:0) ← M(A0); A0 ← A0 + 2
; D3(7:0) ← M(A0); A0 ← A0 + 1
MOVE.W
D3, (A0)+ ; M(A0) ← D3(15:0); A0 ← A0 + 2
MOVE.L (A7)+, D4 ; D4(31:0) ← M(A7); A7 ← A7 + 4

MOVE.W
MOVE.B
(A7)+, D4
(A7)+, D4
; D4(15:0) ← M(A7); A7 ← A7 + 2
; D4(7:0) ← M(A7); A7 ← A7 + 2
an d Tech no lo gy
Dr. Taisir Eldos 20
Address Register Indirect with Pre-decrement
★ Auto adjustment, increment or decrement; faster access to structured data items; tables,
arrays, etc.
★
★
Decrement by 1 for .B, 2 for .W and 4 for .L instructions, hence less time and space
Exception is A7 (USP) and A7’ (SSP), where 2 is used for .B, preserve alignment
Applications include accessing data structures
★
MOVE.L –(A0), D3 ; A0 ← A0 – 4; D3(31:0) ← M(A0)

MOVE.W –(A0), D4
MOVE.B –(A0), D5 By
; A0 ← A0 – 2; D4(15:0) ← M(A0)
; A0 ← A0 – 1; D5(7:0) ← M(A0)
MOVE.W D4, –(A0)

; A0 ← A0 – 2; M(A0) ← D4(15:0)
MOVE.B D4, –(A7) ; A7 ← A7 – 2; M(A7) ← D4(7:0)
★
Jord an U n ive r si ty of
Latency hiding, which of the two modes: –(Ai) and (Ai)+ is faster?
Sci ence
❖
ancyclesdbeforeTech no lo gy
As source, (Ai)+ is faster as we use then increment, but –(Ai) has to decrement first and
have to wait 2 clock use
❖ As destination, the pre-dec latency is also hidden, they are just as fast
Dr. Taisir Eldos 21

Address Register Indirect with Displacement
★ Effective address computed by adding the content of address register to a signed 16-bit word,
d16, which is encoded as part of the instruction
★
★
Effective Address <ea> is the sum of address register content plus displacement
Applications include accessing data structures with records and fields
MOVE.L (12, A1), D3

MOVE.W (– 6, A2), D0
; D3 ← M(ea) where ea = A1 + $C
; D0(15:0) ← M(ea) where ea = A2 – $6
MOVE.B D1, ($24, A3)
By
; M(ea) ← D1(7:0) where ea = A3 + $24
★ Some Assemblers requires the displacement written before the parenthesis; like MOVE.L
★ Example
12(A1), D3 as opposed to MOVE.L (12, A1), D3
Jord
❖
❖ an addressU nsource
ive r first
siinstruction
ty of
If in the above instruction A1 = $123400 and A2 = $123468, then
The effective of the in the is Sci ence
ea = $00123400 + $0000000C = $0012340C
❖ an
The effective address d
of the Tech
source no
in the second lo
instruction
ea = $00123468 + $FFFFFFFA = $00123462 ($FFFFFFFA is – 6)
gy
is
Dr. Taisir Eldos 22

Address Register Indirect with Displacement & Index
★ Effective address is the sum of three components; the address registers, the longword or sign
extended lower order word of the index register, and the offset or displacement, which is 8-
★ Mi cr o pr o c e sso r S y stems
bit signed or d8
Most complex addressing mode
Good for structures, like the element in row r column c in matrix m
★
MOVE.L (6, A1, D0.W), D3 ; D3 ← M(ea) where ea = A1 + SE(D0(15:0)) + $6

MOVE.L D4, ($24, A2, A5)
By
; M(ea) ← D4 where ea = A2 + A5 + $24
★ SE means Sign Extend Xj if needed, then compute: ea = d8 + [Ai] + [Xj]

★ Index register is Xi butDonlyr.
.L or Ta i s iandr theEl
.W allowed, offsetdos
is d8
★
Jord
For the firstan
Example
❖
Uassume:
instruction, n ive r si ty
A1=$1234A6, of Sci
D0=$12348812, then ence
The effective address ea = $1234A6 + $FFFF8812 + $6 = $0011BCBE
an d Tech nothelo gy
❖
❖ Why the effective address is lower than A1 ? Because index is negative
Dr. Taisir Eldos 23

Program Counter Relative
★ Special case of register indirect, where PC is used instead of Address registers
❖ Displacement; ea = PC + d16
★
❖ Displacement & Index: ea = PC + Xj + d8, and
OpCode extension is a word that is d16, or d8 and 5 bits encoding X; j and W/L
CPY MOVE.W
.
(MSG, PC), D1 ; Copies M(MSG) = $4131 to D1(15:0)
; MSG is d16 representing the distance to MSG label
MSG
.
DC.B “A1” By
; from the updated value of the PC which is CPY + 2
★
Actual distance is encoded to be added to the PC in execution
Useful in making relocatable code, i.e. Position Independent Code (PIC) to reside anywhere in
memory
★ Jord an U n ive r si ty of Sci ence
Example:
❖ Assume: MOVE.W instruction is at address $1000 & MSG at address $1008
❖
❖
an d Tech no lo
Then the displacement MSG to be encoded is $1008 - $1002 = $6
Then, the instruction decoding is $323A $0006
gy
❖ When it executes: Source ea = $1002 + $6 = $1008
Dr. Taisir Eldos 24

Stack Operations
★ Stacks are memory sections accessible using dedicated pointers
6208
★ In MC68000, there are two; Supervisor Stack & User Stack
620A
MOVE.W D0, –(SP) ; SP ← SP – 2, M(SP) ← D0(15:0) 620C A3 B5 7
MOVE.W (SP)+, D0 ; D0(15:0) ← M(SP), SP ← SP + 2 620E 67 89 6
Assume the folding andDra f tsegment

L ecbelow tu re s 6210 12 34 4
★ run the code it 6212 56 78 5
SSP = $8428 & USP = $6214 & S = 1 USP 6214
B=y$1A3B5
❖
❖ D1 = $12345678, D2 = $23456789 & D3

7000
MOVE.W D1, –(SP) ; [ 1 ] SP ← SP – 2, M(SP) ← D1(15:0)
MOVE.L D2, –(SP) ; [ 2 ] SP ← SP – 4, M(SP) ← D2(31:16)
; [ 3 ] M(SP+2) ← D2(15:0) 841C

841E
MOVE #$00, SR ; Switch State, S = 0 8420
8422 23 45 2
MOVE.L D1, –(SP) ; [4] SP ← SP – 4, M(SP) ← D1(31:16)
MOVE.W D2, –(SP)

an d Tech no lo gy
;
;
[5]
[6]
M(SP+2) ← D1(15:0)
SP ← SP – 2, M(SP) ← D2(15:0)
8424
8426
67
56
89
78
3
1
MOVE.W D3, –(SP) ; [7] SP ← SP – 2, M(SP) ← D3(15:0) SSP 8428
Dr. Taisir Eldos 25

Microprocessor Systems: Introduction & Historical Review

Uploaded by

Copyright:

Available Formats

You might also like

Microprocessor Systems: Introduction & Historical Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microprocessor Systems: Introduction & Historical Review

Uploaded by

Copyright:

Available Formats

Microprocessor Systems

Introduction & Historical Review

Dr. Taisir Eldos

“Study the past if you would define the future” - Confucius

❖ Internet of Things (IoT) & Internet of Medical Things (IoMT)

Dr. Taisir Eldos 2

❖ Glue Logic; to arbitrate operations various operations Memory

❖ Conventional operating voltages Processor

✦ 5.0 & 3.3 V, Legacy & Modern chips Mouse

Jord an U n ive r si ty of Sci ence

❖ Modern operating voltages Touch

✦ 1.5 V, 1.2 V or 1.1 V for Memory Mic Speaker

Dr. Taisir Eldos 3

❖ MEM class, size, price & performance

❖ I/O components, types & numbers

★ Analyze trade-offs; cost & performance

Jord an U n ive r si ty of Sci ence

❖ Less power consumption

Dr. Taisir Eldos 4

❖ Performance, doing more work in less time is always is a requirement

❖ Operating range: Temperature, Radiation, etc.

❖ Project specifications; for Managers, Developers, Testers, Clients, etc.

❖ Project analysis; Cost, Assumptions, Discrepancies, Choices, etc.

Jord an U n ive r si ty of Sci ence

Dr. Taisir Eldos 5

★ Caused city brownout when turned on

❖ Took 70 hours to compute to 2037 places

Dr. Taisir Eldos 7

★ How does wider data bus affect performance?

Dr. Taisir Eldos 8

❖ Memory: Core memory and Magnetic tape

❖ Huge power demand (160 KW), with liquid cooling

❖ Compute ballistic tables (for the military)

Dr. Taisir Eldos 9

❖ Power: 5 mW versus 250 mW (50 times lesser)

❖ Voltage: 5 V or less versus 120 V (20 times lesser)

❖ Mean Time Between Failures (at least 10 times better)

❖ Robert Noyce and Gordon Moore founded Intel Corporation in 1968

Dr. Taisir Eldos 10

❖ 1966: 100 Transistors, Medium Scale Integration (MSI)

1971 Intel 4004: 2300 transistors, 10 µm, 16-pin

❖ PMOS, 4-bit data & 12-bit address, 740 KHz, 46 instructions

★ 1974 Intel 8080: 6000 transistors, 6 µm, 40-pin

Dr. Taisir Eldos 11

❖ 1990: 1 MTr, Extremely Large Scale Integration (ELSI)

Dr. Taisir Eldos 12

Dr. Taisir Eldos 13

Dr. Taisir Eldos 14

Dr. Taisir Eldos 15

The wafer to feature size ratio is 30 cm / 3 nm = 100,000,000

★ Lithography is like a chef flying an aircraft at 10 Km above

Jord an U n ive r si ty of Sci ence

❖ 16 Features gate pitch

Dr. Taisir Eldos 16

Jord an U n ive r si ty of Sci ence

Dr. Taisir Eldos 17

Jord an U n ive r si ty of Sci ence

Dr. Taisir Eldos 18