ECE222

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

The most natural way to represent a number in a computer system is by a string of bits called a binary number.

Consider an n-bit binary number , we usually interpret the positional number system from
MSB to LSB as:

V(B) =

where V is the value of the number.

Internally, computers generally use a fixed number of bits (8, 16, 32, etc.).

The natural method to perform arithmetic operations is using modular arithmetic.

• Addition is done in a clockwise rotation while subtraction is done a counter clockwise rotation.
• When we cross between 111 and 000, a "carry" is generated.
• When we cross between 000 and 111, a "borrow" is generated.

We also need to be able to represent both positive and negative numbers. Three systems are used for
representing such numbers:

1. Sign-and-magnitude
Dedicate the MSB to the sign.
2. 1's complement
Invert all the bits in the originally positive number to get the negative number. This form allows for
8 patterns but only 7 numbers.
3. 2's complement
Invert all the bits after the first '1' from the right. You can also flip all the bits of the 1's complement
and add 1. This form allows for 8 patterns and 8 numbers.

Sign-and-magnitude and 1's complement have two representations for 0. This is not a good use of bits.
There is a more formal approach to calculate these forms:

• To form the 1's complement of N, an n-bit number:


• To form the 2's complement of N, an n-bit number:

The interpretation of an n-bit 2's complement number is:

If the MSB is a '1', this indicates a negative number.

In the addition and subtraction of the 2's complement, carry and borrow must be ignored.

To add two numbers, add their n-bit representations, ignoring the carry-out bit from the MSB position. The sum will
be algebraically correct value in 2's-complement if the actual result is in the range through

To subtract two numbers X and Y, that is, to perform X-Y, form the 2's-complement of Y, then add it to X using the
add rule. Again, the result will be the algebraically correct value in 2's complement representation if the actual result
is in the range through

Now there is a no problem crossing 111 to 000 (-1 to 0) or vice versa. However, the problem occurs when crossing
011 to 100 (3 to -4) or vice versa, this is an "overflow" or "underflow". It is important to note that if a carry or
borrow occurs, that this has no meaning in signed numbers and is ignored.
Addition, if no carry occurs, answer is correct. However, if a carry occurs, the correct answer is received by adding
back 1 (end-around-carry).

When performing addition/subtraction, if the signs of addends are the same, the sign of the answer must also be the
same. If the signs of the addends are different, ignore any carries. The answer will always be correct. The same rules
apply for 2's complement.

The grouping of bits was done to ease writing. Using 1 bit is known as the binary system, 3 bits is known as the
octal system and 4 bits is known as the hexadecimal system.
Electronic digital computers as we know them today have been developed since the 1940s. A long, slow evolution
of mechanical calculating devices preceded the development of electronic computers. Here, we briefly sketch the
history of computer development.

• A series of complex mechanical devices, constructed from gear wheels, levers and pulleys were used to
perform basic operations of addition, subtraction, multiplication and division. Holes of punched cards were
mechanically sensed and used to control the automatic sequencing of a list of calculations which essentially
provided a programming capability. These devices enables the computation of complete mathematical tables
of logarithms and trigonometric functions as approximated by polynomials. Output results were punched on
cards or printed on paper.
• Electromechanical relay devices, such as those used in early telephone switching systems, provided the means
for performing logic functions in computers built in the late 1930s and early 1940s.
• The first digital computer was done by Charles Babbage and Ada Lovelace (the "first" programmer).
• Vacuum tube circuits were used to perform logic operations and to store data. This technology initiated the
modern era of electronic digital computers

• The key concept of a stored program was introduced at the same time as the development of the first
electronic digital computer. Programs and their data were located in the same memory, as they are today. This
facilitates changing existing programs and data or preparing and loading new programs and data.
• Assembly language was used to prepare programs and was translated into machine language for execution.
• Basic arithmetic operations were performed in a few milliseconds, using vacuum tube technology to
implement logic functions.
• I/O functions were performed by devices similar to typewriters.
• Magnetic core memories and magnetic tape storage devices were also developed.
• Colossus, an electronic digital computer, was built by British codebreakers during World War II from over
1700 vacuum tubes. It was used to break the codes of the German Lorenz SZ-40 cipher machine that was
used by the German High Command.
• This era also saw the first electronic general-purpose computer, the ENIAC, the EDVAC, an electronic
computer designed to be a stored-program computer as well as the IBM 701/704/709.

• The transistor was invented at AT&T Bell Laboratories in the late 1940s and quickly replaced the vacuum
tube in implementing logic functions. This fundamental technology shift marked the start of the second
generation.
• Magnetic core memories and magnetic drum storage devices were widely used.
• Magnetic disk storage devices were developed in this generation.
• The earliest high-level languages, such as Fortran, were developed, making the preparation of application
programs much easier.
• Compilers were developed to translate these high-level language programs into assembly language, which
was then translated into executable machine-language form.
• This era saw the development of the DEC PDP-1, the first "cheap" computer, IBM 7090 and 7094, a "fast"
computer and the CDC 6600.

• Texas Instruments and Fairchild Semiconductor developed the ability to fabricate many transistors on a single
silicon chip, called integrated-circuit technology. This enabled faster and less costly processors and memory
elements to be built. This began to replace magnetic core memories.
• Other developments included the introduction of microprogramming, parallelism and pipelining.
• Operating system software allowed efficient sharing of a computer system by several user programs.
• Cache and virtual memories were developed. Cache memory makes the main memory appear faster than it
really is and virtual memory makes it appear larger.
• System 360 mainframe computers from IBM (IBM System 360) and the line of PDP minicomputers from
Digital Equipment Corporation (DEC PDP-11) were dominant commercial products.
• Integrated circuit fabrication techniques had evolved to the point where complete processors and large
sections of the main memory of small computers could be implemented on single chips.
• Tens of thousands of transistors could be placed on a single chip, and the name VLSI was coined to describe
this technology.
• A complete processor fabricated on a single chip became known as a microprocessor.
• Companies such as Intel, National Semiconductor, Motorola, Texas Instruments and Advanced Micro
Devices have been the driving forces of this technology.
• A particular form of VLSI technology, called Field Programmable Gate Arrays (FPGAs) has allowed system
developers to design and implement processor, memory and I/O circuits on a single chip to meet the
requirements of specific applications, especially in embedded computer systems.
• Embedded computer systems, portable notebook computers and versatile mobile telephone handsets are now
in widespread use. Personal desktop computers and workstations interconnected by wired or wireless local
area networks and the Internet, with access to database servers and search engines, provide a variety of
powerful computing platforms.
• Supercomputers and Grid computers, at the upper end of high performance computing, are used for weather
forecasting, scientific and engineering computation and simulations.
• This generation saw the development of the VAX 9000, IBM 3090, Cray X-MP (supercomputer) and Intel
4004/8008/8080 (personal computers)

• This saw the development of the Cray MPP (massively parallel supercomputer) and the Fujitsu VPP 500.

2009 May 2010 2012 2013 June 2017 June 2019


2017
IBM Cray Jaguar Cray XK6 IBM Titan China - China's US-
Roadrunner Sequouia Sunway Tianhe-2 government
Taihulight contract with
Intel and Cray
to build Aurora
12960 IBM 224256 x86- AMD 18688 AMB
PowerXcell based AMD Opteron 6200 16 core
81CPUs Opteron processors CPUS (16
6480 AMD Processor NVIDIA cores in each
Opteron Cores Tesla 20- chip)
Dual-Core Series GPUS 18688
Processors (graphic NVIDIA
processor) Tesla GPUs
2.35 MW 7.9 MW 8.2 MW
296 racks 3000
square
feet
104 Tb 1.6 PB 40 PB
1.7 petaflops 1.75 petaflops 63.8 petaflops 16.32 27 petaflops 93 33.86 1.5 exaflops
(floating per cabinet petaflops petaflops petaflops
point Scalable to 50
operations petaflops
per second)
• It is important to note that the scale (of sum) is exponential, although it looks linear.

• NSA Prism Data Center, Bluffdale, Utah:


It is estimated that they collect 10TB of data every 6 hours. The facility will have capacity for 5 zettabytes (1
billion terabytes).
Since their introduction in the 1940s, digital computers have evolved into many different types that vary widely in
size, cost, computational power and intended use. Modern computers can be divided roughly into four general
categories:
• Embedded computers are integrated into a larger device or system in order to automatically monitor and
control a physical process or environment. They are used for a specific purpose rather than for general
processing tasks. Typical applications include industrial and home automation, appliances,
telecommunication products and vehicles.
• Personal computers support a variety of applications such as general computation, document preparation,
computer-aided design, audiovisual entertainment, etc. There are a number of classifications for personal
computers:
○ Desktop computers serve general needs and fit within a typical personal workplace
○ Workstation computers offer higher computational capacity and more powerful graphical display
capabilities for engineering and scientific work
○ Portable and notebook computers provide the basic features of a personal computer in a smaller
lightweight package.
• Servers and enterprise systems are large computers that are meant to be shared by a potentially large number
of users who access them from some form of personal computer over a public or private network. Such
computers may host large databases and provide information processing for a government agency or a
commercial organization.
• Supercomputers and grid computers normally offer the highest performance. They are the most expensive and
physically the largest category of computers. Supercomputers are used for the highly demanding computation
needed in weather forecasting, engineering design and simulation, and scientific work. They have a high cost.
Grid computers provide a more cost-effective alternative. They combine a large number of personal
computers and disk storage units in a physically distributed high-speed network, called a grid, which is
managed as a coordinated computing resource. By evenly distributing the computational workload across the
grid, it is possible to achieve high performance on large applications ranging from numerical computation to
information searching.

There are three basic components:


• Input/Output (I/O)
Input devices allow input of data from the outside world. It includes keyboards, tape, switches, sensors, etc.
Output devices display results, actions to the outside world. It includes displays, printers, relays, etc.
• Memory
This is the storage of programs and data. Generally, however, there is a tradeoff between cost and speed. High
speed RAM costs more than large storage. For example, the DVD is the cheapest memory device per unit
storage but it is also the slowest. There are two classes of storage, called primary and secondary.
○ Primary storage, is a fast memory that operates at electronic speeds. Programs must be stored in this
memory while they are being executed. This memory consists of a large number of semiconductor
storage cells, each capable of storing one bit of information.
○ Secondary storage are external devices such as DVDs and flash memory devices.
• Central Processing Unit (CPU)
The CPU consisted of the control, arithmetic and logic unit and temporary storage. The control is responsible
for the execution of instructions, timing and sequencing. The ALU is responsible for performing arithmetic
and logic operations. The temporary storage includes fast registers on chip, cache.

The operation of a computer can be summarized as follows:


• The computer accepts information in the form of programs and data through an input unit and stores it in
memory.
• Information stored in the memory is fetched under program control into an arithmetic and logic unit, where it
is processed
• Processed information leaves the computer through an output unit
• All activities in the computer are directed by the control unit
There are two basic organizations - single and multiple (two) bus structures. A bus is a collection of wires grouped
together for a single purpose
• Single bus - all components are connected by a single "set" of "wires"

• Two bus structures (costs alot more)


• The memory consists of many millions of storage cells, each of which can store a bit of information having
the value 0 or 1. Because a single bit represents only a very small amount of information, bits are seldom
handled individually.
• The bits that make up the main memory are organized as a series of fixed sized "words" which are referenced
by their memory "address". This is so that a group of n bits can be stored or retrieved in a single, basic
operation.
• Each group of n bits is referred to as a word of information, and n is called the word length.

• The lowest memory address is located at the 0 while the highest memory address is located at the 2^n -1
position.
• The address points to a unique location in the memory.
• The contents located at a memory address is independent of the memory address itself.

• Modern computers have word lengths that typically range from 16 to 64 bits. If the word length of a
computer is, for example 32 bits, a single word can store a 32-bit signed number or four ASCII-encoded
characters, each occupying 8 bits. A unit of 8 bits is called a byte.
• Machine instructions may require one or more words for their representation.

• Accessing the memory to store or


retrieve a single item of information
requires distinct names or addresses
for each location.
• It is customary to use numbers from 0
to 2^n -1, for some value of n, as the
addresses of successive locations in
the memory.
• Hence, the memory can have up to 2
^n memory locations.
• The 2^n addresses constitute the
address space of the computer.
• Definition (Byte): 8 bits
• Definition (Word Length): 16 to 64 bits (4 to 8 bytes)
• It is impractical to assign distinct addresses to individual bit locations in the memory.
• Instead, successive addresses refer to successive byte locations in the memory. The term byte addressable
memory is used for this assignment.
• Byte locations have addresses 0, 1, 2….Thus, if the word length of the machine is 32 bits, successive words
are located at addresses 0, 4, 8…., with each word consisting of four bytes.

There are two ways to assign bits:


• Big-Endian: lower byte addresses are used for the most significant bytes of the word
• Little-Endian: lower byte addresses are used for the least significant bytes of the word

• In both cases, byte addresses are taken as the addresses of the successive words in the memory of a computer
with 32-bit word length. These are the addresses used when accessing the memory to store or retrieve a word.
• Which is better? Either
• Which is preferred? Big-Endian because the bits appear in the correct order
• Both program instructions and data operands are stored in the memory.
• How to execute an instruction? The processor control circuits must cause the word (or words) containing the
instruction to be transferred from the memory to the processor. Operands and results must also be moved
between the memory and the processor. Thus, two basic operations involving the memory are needed - load
and store.

Basic Example Definition Memory How to Initiate?


Operation Contents
Load Read, Fetch This gets information from The memory The processor sends the address of
specific memory locations. contents the desired location to the memory
It transfers a copy of the remain and requests that its contents be
contents of a specific unchanged read. The memory reads the data
memory location to the stored at that address and sends
processor. them to the processor.
Store Write This transfers an item of The former The processor sends the address of
information from the memory the desired location to the memory,
processor (CPU) to a contents of together with the data to be written
specific memory location. that location into that location. The memory
is then uses the address and data to
overwritten perform the write.
• Definition (Computer program): A computer program is a sequence of arithmetic, logic and control
operations (transfers, etc).
• A computer must have instructions capable of performing four types of operations:
○ Data transfers between the memory and the process registers
○ Arithmetic and logic operations on data
○ Program sequencing and control
○ I/O transfers

• Definition (Instructions): Instructions are a series of binary bits stored in memory and interpreted by the
CPU. Instructions may consist of one or more words in memory. Instructions are also characterized by the
number of addresses (for operands which also exist in memory) they require. They specify an operation to be
performed and the operands involved.

Type of Format Definition


Instruction
3-address ADD A, B, C Operands found in locations A and B are added and the result stored
instructions in location C (Operation, Source1, Source2, Destination)
2-address ADD A, B Operands found in locations A and B are added and the result
instructions replaces the contents of B (Operation, Source, Destination).
1-address ADD A One of the operands is found in a default location (such as a CPU
instruction register)
1 1/2- address ADD A, Reg One of the operands is found in a named CPU register
instruction
0-address PUSH or POP Operands are in a fixed location (such as a stack)
instruction

• 3 address instructions are very flexible but programs are very large (multiple word instructions)

RISC vs CISC Instruction Sets


• One of the most important characteristics that distinguish different computers is the nature of their instruction.
There are two fundamentally different approaches in the design of instruction sets for modern computers.
• RISC (Reduced Instruction Set Computers):
○ One popular approach is based on the premise that higher performance can be achieved if each
instruction occupies exactly one word in memory, and all operands needed to execute a given
arithmetic or logic operation specified by an instruction are already in processor registers. This
approach is conducive to an implementation of the processing unit in which the various operations
needed to process a sequence of instructions are performing in "pipelined" fashion to overlap activity
and reduce total execution time of a program.
○ The restriction that each instruction must fit into a single word reduces the complexity and the number
of different types of instructions that may be included in the instruction set of a computer
○ SUMMARY: Small set of instructions with few addressing options but are very fast and efficient -
easier to design compiler but longer programs
• CISC (Complex Instruction Set Computers):
○ Complex instructions which may span more than one word of memory, and which may specific more
complicated operations.
○ SUMMARY: Lots of instruction in instructions set, many addressing options - very difficult to get
compiler to actually use some instructions but smaller programs
• Key characteristics are:
○ Each instruction fits in a single word. Hence, there are simple addressing modes
○ A load/store architecture is used
▪ Operands in memory are accessed using load and store instructions to move to registers
▪ All operands involved in an arithmetic or logic operation must either be in processor registers,
or one of the operands may be given explicitly within the instruction word
○ There are a small number of instructions
○ Larger programs
• Key characteristics are:
○ Not constrained to the load/store architecture, in which arithmetic and logic operations can be
performed only on operands that are in processor registers - many complex addressing modes,
operands can be in registers or memory
○ Instructions do not necessarily have to fit into a single word. Some instructions may occupy a single
word, but others may span multiple words.
○ Programs are smaller
• Harvard Architecture
○ Stores machine instructions and data in separate memory units that are connected by different busses.
○ There are at least two memory address spaces to work with, so there is a memory register for machine
instructions and another memory register for data.
○ Computer designed with the Harvard architecture are able to run a program and access data
independently, and therefore simultaneously.
○ It has a strict separation between data and code. Thus, it is more complicated.

• Von Neumann Architecture


○ Single bus for data and instructions.
○ Memory contains instructions and data that run the program
○ Since you cannot access program memory and data memory simultaneously, this architecture is
susceptible to bottlenecks and system performance is affected.
○ It is more sufficient in terms of memory space and hence allows for malware easily. However, it is
cheaper.
• How does the CPU know which is which? Due to prior memory
allocation
• Definition (Program Code): This is a "sequence" of instructions
placed in "sequential" memory locations.
• It uses a special register called the Program Counter (PC) to keep
track of program execution

• Instruction Register (IR): holds


the current instruction (not
address) that is being executed
• Program Counter (PC): stores
address of the next instruction in
RAM (memory)
• Memory Data Register (MDR):
stores the data that is to be sent
to or fetched from memory
• Memory Address Register
(MAR): stores the address of the
current instruction being
executed
• CONTROL: to be able to
communicate with memory
• General Purpose Registers: used
for different purposes such as
holding of intermediate results

• The CPU registers are the high-speed memory location built into the microprocessor. CPU uses these
memory locations to store data and instructions temporarily for processing.
• There are two parts to the execution of an instruction:
1. Fetch the instruction from memory (as pointed to be the PC) and place it in the Instruction Register
(IR).
2. Perform the specific instruction:
i. Fetch operands
ii. Arithmetic/logic operations
iii. Store results in memory
iv. Update PC and repeat

Let us consider an instruction ADD A, B which is in location M of memory.

• Fetch the first instruction in the program from the main memory.
The PC is the key register here. Copy the PC into the MAR. The
address M is first placed in the MAR. A read signal is generated
by the CPU.
• Because the MAR is clocked, the PC is unaltered. Read the
memory into the MDR. The content of location M, i.e., ADD A,
B instruction is placed in MDR which is then placed in the IR.
The instruction is decoded and required signal components are
activated to perform the operation.
• Instruction is executed, that is, the ADD operation is performed.
• First Operand address A is placed in MAR and read signal is
generated. The operand content 5 is fetched from memory and
placed in MDR. This data 5 is sent to ALU by the single bus.
Second operand address B is placed in MAR and read signal is
generated. Operand data 2 is fetched and placed in MDR. This
data, 2 is then sent to ALU by the single bus. Once two operands
are available inside the ALU, arithmetic addition is performed
due to the signals generated by the system. The result 7 is
temporarily placed in the general purpose register R0.
• It is then sent to MDR by the single bus. An address of
memory where the result data should be stored is placed in
MAR. Once, the result data and address are placed in MDR and
MAR respectively, write signal is generated which fetches the
result data from CPU to proper memory position.
• The considered instruction ADD A, B is in 2-address instruction
format. In this format, 2nd operand source is destination also.
Hence, MAR will hold B and result data 7 is placed to
location B of memory.
• PC is incremented to get the next instruction.
Load Register R1 with the contents of location with Label A [contents of R1 <- contents of A]

There is no real difference between LDR and MOV.

( ) - contents of

The processor contains a register called


the PC which holds the address of the
next instruction to be executed. To begin
executing a program, the address of its
first instruction (i) must be placed into
the PC. Then, the processor control
circuits use the information in the PC to
fetch and execute instructions, one at a
time, in the order of increasing addresses.
This is called straight-line sequencing.
During the execution of each instruction,
the PC is incremented by 4 to point to the
next instruction. Thus, after the Store
instrument at location i+12 is executed,
the PC contains the value i+16, which is
the address of the first instruction of the
next program segment.

Executing a given instruction is a two-phase procedure. In the first phase, called instruction fetch, the
instruction is fetched from the memory location whose address is in the PC. This instruction is placed in the IR
in the processor. At the start of the second phase, called instruction execute, the instruction in IR is examined to
determine which operation is to be performed. The specified operation is then performed by the processor. This
involves a small number of steps such as fetching operands from the memory or from processor registers,
performing an arithmetic or logic operation, and storing the result in the destination location. At some point
during this two-phase procedure, the contents of the PC are advanced to point to the next instruction. When the
execute phase of an instruction is completed, the PC contains the address of the next instruction, and a new
instruction fetch phase can begin.
• Sometimes, next sequential instruction is not the one to be executed - loops, conditional tests, etc. Branch
instruction are required. This type of instruction loads a new address into the program counter. As a
result, the processor fetches and executes the instruction at this new address, called the branch target,
instead of the instruction at the location that follows the branch instruction in sequential address order.
• Consider adding a series of N numbers together - we could just write a straight line program to add them
but not in general.

#: immediate operand

R0: partial sum


R1: keeps track of how many
times to repeat
R2: contains the address of
the first element of the array

[ ]: indirect
[R2] (or ((R2))): takes the
contents of memory location *Left out a step: You need to
pointed to the contents of R2 change R2. You need to
increment it by 4 bytes to get
the other number to add.
Otherwise, you would have
added the same number over
and over

• What would happen if there are too much numbers?


○ After every addition you should check if there are enough bits to represent the number.
○ Otherwise, you can add more bits
• A conditional branch (BCC) tests the contents of a condition code register or status register
• The four most common flags are:
○ N (negative) set if the MSB is 1, cleared otherwise
○ Z (zero) set if result is 0
○ V (overflow)
○ C (carry)
• The condition code flags are usually set upon the result of any arithmetic or logic operation
• Type of numbers (2's complement or unsigned) is determined solely by what flags are tested

B No test Unconditional Always take the branch


BEQ Z =1 Equal to zero Comparison equal or zero result
BNE Z=0 Not equal to zero Comparison not equal or non-zero result
BCS C=1 Carry set Arithmetic operation gave carry out
BCC C=0 Carry clear Arithmetic operation did not produce a carry
• There are many methods of addressing operands
• Not all are implemented in all processors (some are more useful than others)
• RISC will use a small subset of these.
• Programs are normally written in a high level language, which enables the programmer to conveniently
describe the operations to be performed on various data structures. When translating a high level language
program into assembly language, the compiler generates appropriate sequences of low level instructions
that implement the desired operations. The different ways for specifying the locations of instruction
operands are known as addressing modes.

Name Assembler Syntax Addressing Example


(Context) Function
Immediate #Value Operand = Value MOVE #6, R1
(R1) <- 6

Register Ri EA = Ri MOVE R1, R2


(R2) <- (R1)

Absolute (Direct) LOC EA = LOC MOVE LOC, R2


R2 <- (LOC)

Register Indirect (Ri) or (LOC) EA = (Ri) or (LOC) ADD (R2), R1


(R1) <- ((R2)) + (R1)

ADD (LOC), R3
(R3) <- ((LOC)) + (R3)

Index X(Ri) EA = (Ri) + X MOVE 8(R0), R3


(R3) <- ((R0) + 8)

Base with Index (Ri, Rj) EA = (Ri) + (Rj) MOVE (R1, R2), R3
(R3) <- ((R1) + (R2))

Base with Index and X(Ri, Rj) EA = (Ri) + (Rj) + X MOVE 8(R1, R2), R3
Offset (R3) <- ((R1)+(R2)+8)

Relative X(PC) EA = (PC) + X BNE LOOP

If Z =1, (PC) <- (PC) + #


(the appropriate value) else
(PC) <- (PC) + relative
offset
X

Autoincrement (Ri)+ EA = (Ri) ADD (R2)+, R3


(or Auto Post Increment) (R3) <- ((R2)) + (R3)
(R2) <- (R2) + inc

Autodecrement -(Ri) EA = (Ri) ADD -(R2), R3


(or Auto Pre Decrement) (R2) <- (R2) - inc
(R3) <- ((R2)) + (R3)

EA = Effective address VALUE = a signed number X = index value


• Advanced RISC Machines (ARM) word length is 32 bits, memory is byte-addressable using 32-bit
addresses and the processor registers are 32 bits long.
• Three operand lengths are used in moving data between the memory and the processor registers: byte (8
bits), half word (16 bits) and word (32 bits).
• Word and half-word addresses must be aligned, that is, they must be multiples of 4 and 2, respectively.
• Both little-endian and big-endian memory addressing schemes are supported.
• RISC-Style Aspects:
○ All instructions have a fixed length of 32 bits
○ Only Load and Store instructions access memory
○ All arithmetic and logic instructions operate on operands in processor registers
• CISC-Style Aspects:
○ Autoincrement, autodecrement and PC-relative addressing modes are provided
○ Condition codes (N, Z, V and C) are used for branching and for conditional execution of
instructions.
○ Multiple registers can be loaded from a block of consecutive memory words, or stored in a block,
using a single instruction (stack)

• There are 15 general purpose registers (R0-R14) plus a 32 bit program counter (R15)
• There is also a 32 bit current program status register (CPSR) which holds, among other things, the
condition code flags (N, Z, C, V)

• Six operating modes:


○ User
○ Fast Interrupt Mode (FIQ)
○ Normal Interrupt (IRQ)
○ Supervisor
○ Abort (for handling memory access violations)
○ Undefined (Undef) for handling undefined instructions
• Immediate Addressing: [#VALUE]
○ 16 or 12 bits are available in the 32 bit instruction for an immediate value (no single instruction to
load a 32 bit immediate operand)
○ Also, the number of bits available for an immediate operand may be smaller
○ These can be moved to the bottom or the top of a register using MOVW (word bottom 16 bits) or
MOVT (word top 16 bits)
• Register (basic mode for arithmetic and logic operations)
• Absolute (load/store operation)
• Basic Indexed Addressing
○ Pre-indexed mode - The effective address of the operand is the sum of the contents of a base
register, Rn, and a signed offset
○ Examples:

• Relative Addressing Mode (up to 12 bits)


○ The PC (R15) can be used as the Base register Rn, along with an immediate offset, in the Pre-
indexed addressing mode.
○ Example: LDR R0, NUMB
○ The assembler will calculate the location of NUMB relative to the PC (12-bit)
• Indexed Address Mode with Writeback
○ This allows auto update of addresses (for example, going through a list)
○ Specified by suffix of "!" on instruction
○ Pre-indexed with writeback mode - The effective address of the operand is generated in the same
way as in the Pre-indexed mode, then the effective address is written back into Rn
○ Post-indexed mode - The effective address of the operand is the contents of Rn. The offset is then
added to this address and the result is written back into Rn.
○ Examples:

○ The contents of an index can also be scaled:


• There are specific branch
instructions on the ARM
such as BEQ, BNE, etc.
• However, all instructions
are conditionally executed -
they are executed only if
current condition codes
satisfy the conditions
specified in the 4 bit CC
(condition code) field in the
instruction

Calculation of Branch Offset


• Conditional branch instructions have a 24 bit 2's complement value that is used to generate a branch offset
• If the branch is executed, the 24 offset is shifted two places to the left (to align it with a word boundary)
and sign extended to 32 bits
• The offset is added to the "updated" version of the PC which is actually pointing ahead of the instruction
since it has incremented the PC already (depending on the version of the ARM processor, that may be up
to 8 bytes).
• Good news! The assembler does the calculation for you.

Example: The BEQ instruction


(branch if equal to 0) causes a
branch if the Z frag is set to 1. The
appropriate 24-bit value in the
instruction is computed by the
assembler. In this case it would be
92/4 = 23
• Unlike most CISC processors, the condition code flags of the ARM are not set automatically on the
outcome of an arithmetic or logic operation (except for comparisons).
• Example: ADDS R0, R1, R2
• They are usually followed by conditional branch instructions.

• ADD
• ADC (add with carry)
• SUB
• SBC (subtract with carry)
• RSB (reverse subtract)
• RSC (reverse subtract with carry)
• CMP (compare)
• TST (AND test)
• TEQ (XOR test)
• AND
• XOR
• ORR (Or operation)
• MVN (Move negative - moves the 1's complement of the operand)
• Definition (Stack): An ordered list of elements, usually words, with the accessing restriction that elements
can be added or removed at one end of the list only (LIFO - last in first out)
• We define two operations: push (put) and pop (take)
• A push operation moves a new operand to the top-of-stack (TOS) (the previous top element is moved down)
• A pop operations takes the top of stack and moves the next lower element to the top
• The structure is sometimes referred to as a pushdown stack.
• ARM uses a Branch and Link (BL) instruction for subroutine calls.
• The return address (next instruction in calling routine) is stored in R14 -if nesting occurs, it must be saved.
• In modern computers, a stack is implemented by using a portion of the main memory for this purpose. In the
ARM, the stack is realized in memory with the aid of a special register (R13) called the Stack Pointer (SP). It
is used to point to a particular stack structure called the processor stack.
• We use a stack that grows in the direction of decreasing memory addresses. The stack is usually placed in
higher memory and programs are placed in "lower" memory - in this way, there is less likelihood of a
collision.
• The SP is always pointing to the TOS.

Pushing a New Element Onto The Stack


• To push a new element on, the SP is decremented by the appropriate amount then the new element written
using register indirect via the stack

STR R0, [R13, #-4]!

OR

SUB R13, R13, #4


STR R0, (R13)

Where R13 is the SP which is always pointing to the TOS


! Updates the stack pointer by decrementing by 4

If you want to push something, you would want to push something into the position "less than 4".

Popping an Element Off the Stack


• To pop an element off the stack, the top value from the stack is loaded into the register and then the SP is
incremented by 4 so that it points to the new top element.

LD R0, (R13)
ADD R13, R13, #4
• Definition (Subroutine): A subroutine is a block of instructions that is executed each time the task has no be
performed.
• Subroutines are used to produce more compact (albeit slower) code for several reasons
○ Avoids duplication
○ Reuse of code
○ Library code
○ Enable modular approach to programming
• Any program that requires the use of the subroutine simply branches to its starting location. When a program
branches to a subroutine, we say that it is calling the subroutine. The instruction that performs this branch
operation is named a Call instruction.
• After a subroutine has been executed, the calling program must resume execution, continuing immediately
after the instruction that called the subroutine. The subroutine is said to return to the program that called it
and it does so by executing a Return instruction.
• Since the subroutine may be called from different places in a calling program, provision must be made for
returning to the appropriate location. The location where the calling program resumes execution is the
location pointed to by the updated program counter (PC) while the Call instruction is being executed. Hence,
the contents of the PC must be saved by the Call instruction to enable correct return to the calling program.

• The way in which a computer makes it possible to call and return from subroutines is referred to as its
subroutine linkage method.

• The simplest subroutine linkage method is to save the return address in a specific location which may be a
register dedicated to this function.
• Such a register is called the link register. When the subroutine completes its task, the Return instruction
returns to the calling program by branching indirectly through the link register. This allows multiple levels of
subprograms but the calling program must (safely) store the return address, that is, you must save the
previous before you get the other. There is also no recursion.
• We can also use a fixed memory location to store the return address. Although this is simple, its disadvantage
is that there is only one level of subroutine.
• The Call instruction is just a special branch instruction that performs the following operations:
○ Stores the contents of the PC in the link register
○ Branch to the target address specified by the Call instruction
• The Return instruction is a special branch instruction that performs the operation
○ Branch to the address contained in the link register
• Definition (Subroutine Nesting): A common programming practice, called subroutine nesting, is to have one
subroutine call another. In this case, the return address of the second call is also stored in the link register,
overwriting its previous contents. Hence, it is essential to save the contents of the link register in some other
location before calling another subroutine. Otherwise, the return address of the first subroutine will be lost.

• Subroutine nesting can be carried out to any depth. Eventually, the last subroutine called completes its
computations and returns to the subroutine that called it. The return address needed for this first return is the
last one generated in the nested call sequence.
• That is, return address are generated and used in the LIFO order. This suggests that the return addresses
associated with subroutine calls should be pushed onto the processor stack.

• Stack Linkage:
○ Call to subroutine pushes address of next sequential instruction (from the PC/link register) onto the
stack, accessed through the stack pointer SP, before it calls another subroutine
○ Return pops the saved return address from the stack and load it into the PC/link register
○ Levels and recursion only limited by stack space
○ Example (ARM): ARM uses R14 to store the return address. However, it must be pushed on stack if
nesting occurs.
• Definition (Parameter Passing): When calling a subroutine, a program must provide to the subroutine the
parameters, that is, the operands or their addresses, to be used in the computation. Later the subroutine returns
other parameters, which are the results of the computation. This exchange of information between a calling
program and a subroutine is referred to as parameter passing.
• There are several ways to pass parameters to/from subprograms:
○ Fixed locations (assuming you know them ahead of time)
○ Registers (straightforward and efficient if you have lots of them)
○ Via the stack (very powerful but can get confusing)
• Subroutine considerations:
○ Main or calling routine may be using some registers - they may depend upon the contents
○ Can be fatal if subprogram changes registers
• Procedures to remember when passing parameters:
○ Load parameters onto stack
○ Access parameters during subprogram
○ Return results to calling routine
○ Remove stack clutter! (# of pushes = # of pops)
○ Do not change stack point to retrieve parameters during subprogram. You can copy SP to an address
register and use it instead

• Generally, there should only be one Return point from a subprogram


• What if there are multiple points where a return is required? Use a BRANCH to get to the Return point
• Machine instructions in the computer are represented by patterns of 0s and 1s.
• Such patterns are inconvenient to deal with when discussing or preparing programs. Therefore, we use
symbolic names to represent the patterns such as MOV, LDR, ADD, SUB, CLR.
• A complete set of such symbolic names and rules for their use constitutes a programming language, generally
referred to as an assembly language.
• The set of rules for using the mnemonics and for specification of complete instructions and programs is called
the syntax of the language.
• Programs written in an assembly language can be automatically translated into a sequence of machine
instructions by a program called an assembler.
• Differences between an assembler and a compiler:
○ Assembler: Instructions translated one at a time and specific to a processor
○ Compiler: Multiple instructions translated at a time and can run on any processor

• Definition (Assembler): The assembler program is one of a collection of utility programs that are a part of
the system software of a computer. The assembler, like any other program, is stored as a sequence of machine
instructions in the memory of the computer.
• A user program is usually entered into the computer through a keyboard and stored either in the memory or
on a magnetic disk. At this point, the user program is simply a set of lines of alphanumeric characters. When
the assembler program is executed, it reads the user program, analyzes it, and then generates the desired
machine language program. The latter contains patterns of 0s and 1s specifying instructions that will be
executed by the computer. The user program in its original alphanumeric text format is called a source
program, and the assembled machine-language program is called an object program.

• We must also be able to control certain aspects of the assembly process - assembler directives.
• In addition to providing a mechanism for representing instructions in a program, assembly language allows
the programmer to specify other information needed to translate the source program into the object program.
We have already mentioned that we need to assign numerical values to any names used in a program.
Suppose that the name TWENTY is used to represent the value 20. This fact may be conveyed to the
assembler program through an equate statement such as

TWENTY EQU 20

This statement does not denote an instruction that will be executed when the object program is run; in fact, it
will not even appear in the object program. It simply informs the assembler that the name TWENTY should
be replaced by the value 20 wherever it appears in the program. Such statements, called assembler directives
(or commands), are used by the assembler while it translates a source program into an object program.

• You must be very specific in directing the assembler.


• Example:
○ AREA specifies start of instructions and data
○ ENTRY specifies the point where program execution starts
○ DCD (define constant data) labels and initializes the value of operands
▪ SUM DCD 0 associates SUM with a memory location and sets that location value to zero
▪ TEN EQU 10 associated the value 10 with the label TEN (10 is a decimal number)
We illustrate some of the ARM directives in the figure above, which gives a complete source program for:

The AREA directive, which uses the argument CODE or


DATA, indicates the beginning of a block of memory that
contains either program instructions or data. Other
parameters are required to specify the placement of code
and data blocks into specific memory areas. The ENTRY
directive specifies that program execution is to begin at
the following LDR instruction.

In the data area, which follows the code area, the DCD
directives are used to label and initialize the data
operands. The word locations SUM and N are initialized
to 0 and 5, respectively, by the first two DCD directives. The address NUM1 is placed in the location POINTER by
the next DCD directive. The combination of the instruction LDR R2, POINTER and the data declaration POINTER
DCD NUM1 is one of the ways that the pseudoinstruction LDR R2, =NUM1 can be implemented.

The last DCD directive specifies that the five numbers to be added are placed in successive memory word locations,
starting at NUM1.
• One of the basic features of a computer is its ability to exchange data with other devices. This communication
capability enables a human operator, for example, to use a keyboard and a display screen to process text and
graphics.
• Types of Transfers
1. Parallel transfer
▪ Multiple bits are transferred simultaneously
▪ High speed but costly
▪ Problems with long distances (timing, etc.)
2. Serial transfer
▪ Uses a single wire and send data one bit at a time
▪ Slower speed
▪ Less costly over distance
▪ Two types:
□ Synchronous Serial Transfer
 Transmitter sends data bits along with clock signal so receiver knows when data is
valid
 Higher speed but requires additional lines
 Speed can also be varied
□ Asynchronous Serial Transfer
 No clock signal is sent
 Sender and receiver agree on baud rate (bps) and thus the duration of each bit
 Special signaling is required to synchronize sender/receiver (start bits, stop bits,
parity bits)

▪ In serial transfer, connections are classified as to their type and direction:


□ Simple - one way connection
□ Half duplex - one way or the other - PTT
□ Full duplex - bidirectional connection
• The components of a computer system communicate with each other through an interconnection network
which consists of circuits needed to transfer information between the processor, the memory unit, and a
number of I/O devices.

• There are several ways of accessing an I/O device:


○ Special instructions such as OUT dev i, IN dev i
○ General MOVE instructions - here the I/O device appears as a series of memory locations - memory-
mapped-I/O
• Each I/O device must appear to the processor as consisting of some addressable location. Some addresses in
the address space of the processor are assigned to these I/O locations, rather than to the main memory.
• These locations are usually implemented as bit storage circuits (flip-flops) organized in the form of registers.
It is customary to refer to them as I/O registers.
• Since the I/O devices and the memory share the same address space, this arrangement is called memory-
mapped I/O. It is used in most computers.
• With memory-mapped I/O, any machine instruction that can access memory can be used to transfer data to or
from an I/O device.
• For example, if DATAIN is the address of a register in an input device, the instruction LOAD R2, DATAIN
reads the data from the DATAIN register and loads them into processor register R2. Similarly, the instruction
Store R2, DATAOUT sends the contents of register R2 to location DATAOUT, which is a register in an
output device.
1 bus:

2 bus:
• An I/O device is connected to the interconnection network by using a circuit, called a device interface, which
provides the means for data transfer and for the exchange of status and control information needed to
facilitate the data transfers and govern the operation of the device.
• The interface includes some registers that can be accessed by the processor.
• One register may serve as a buffer for data transfers, another may hold information about the current status of
the device, and yet another may store the information that controls the operational behavior of the device.
• These data, status, and control registers are accessed by program instructions as if they were memory
locations.
• Typical transfers of information are between I/O registers and the registers in the processor.
• I/O devices are memory-mapped, that is they look like one or more memory locations
• Simplest form of I/O is Program Controlled I/O - here, the program has complete control of the I/O operation.
For example: consider a task that reads characters typed on a keyboard, stores these data in the memory, and
displays the same characters on a display screen. A simple way of implementing this task is to write a
program that performs all functions needed to realize the desired action.

• In addition to transferring each character from the keyboard into the memory, and then to the display, it is
necessary to ensure that this happens at the right time.
• An input character must be read in response to a key being pressed. For output, a character must be sent to the
display only when the display device is able to accept it.
• The rate of data transfer from the keyboard to a computer is limited by the typing speed of the user, which is
unlikely to exceed a few characters per second. The rate of output transfers from the computer to the display
is much higher. It is determined by the rate at which characters can be transmitted to and displayed on the
display device, typically several thousand characters per second. However, this is still much slower than the
speed of a processor that can execute billions of instructions per second. The difference in speed between the
processor and I/O devices creates the need for mechanisms to synchronize the transfer of data between them.

• One solution to this problem involves a signaling protocol. On output, the processor sends the first character
and then waits for a signal from the display that the next character can be sent. It then sends the second
character, and so on.
• An input character is obtained from the keyboard in a similar way. The processor waits for a signal indicating
that a key has been pressed and that a binary code that represents the corresponding character is available in
an I/O register associated with the keyboard. Then the processor proceeds to read that code.
• General Procedure:
○ Processor checks device's status
○ Processor moves data (in or out)

• As a result, program controlled I/O is simple but:


○ Slow as the CPU is tied up
○ A lot of instruction (time) overhead per byte transfer
○ Higher speed devices may over-run the system
• An alternative approach (instead of PC I/O) can be used to transfer blocks of data directly between the main
memory and I/O devices. A special control unit is provided to manage the transfer, without continuous
intervention by the processor. This approach is called direct memory access or DMA.
• The unit that controls DMA transfers is referred to as a DMA controller.
• It may be part of the I/O device interface, or it may be a separate unit shared by a number of I/O devices.
• The DMA controller performs the functions that would normally be carried out by the CPU when accessing
the main memory.
• For each word transferred, it provides the memory address and generates all the control signals needed. It
increments the memory address for successive words and keeps track of the number of transfers.
• Although a DMA controller transfers data without intervention by the CPU, its operation must be under the
control of a program executed by the CPU, usually an operating system routine. To initiate the transfer of a
block of words, the CPU sends to the DMA controller the starting address, the number of words in the block,
and the direction of the transfer. The DMA controller then proceeds to perform the requested operation. When
the entire block has been transferred, it informs the CPU by raising an interrupt.

• CPU does not require memory access every available cycle


• If there is a large volume of data to/from sequential locations in memory, DMA controller could take control
of memory bus, store data from I/O device directly into memory, release bus and wait for next byte/word
from I/O device.
• The CPU provides:
○ Start address in memory
○ Word count
○ Function (read/write)
○ Address of I/O device (if required)

DMA vs PC I/O

Program Controlled I/O Direct Memory Access


Cheap relative to DMA Costly
Simple DMA controller must have facilities to generate addresses, drive data bus, generate
control signals
CPU controls all Good for large amounts of data stored into continuous memory locations
operations
Slow Bus arbitration is required
CPU is tied up
PROCESS:
CPU DMA Control Device
Load DMA control with
parameters
Issue GO to DMA control
Data ready? Request DMA
operation
Bus request
Bus grant
Assume bus master
Place address, R/W
Put data on bus - remove request
After memory function complete,
remove address
Remove data
Update memory address and word count
Drop bus request
CPU is bus master Get next word from device
Is word count 0?
Indicate to CPU task is complete (via
interrupt)

• How often does the DMA get control of the bus?


○ Cycle stealing - let CPU continue processing
○ Burst mode
○ Special DMA channel (special memory)
• It would be easier in many cases if the device could signal the CPU when it needs service
• If we had a special control wire to allow this - CPU could be doing normal processing and only run the
Interrupt Service Routine when needed
• For example: if the program enters a wait loop in which it repeatedly tests the device status. During this
period, the processor is not performing any useful computation. There are many situations where other tasks
can be performed while waiting for an I/O device to become ready. To allow this to happen, we can arrange
for the I/O device to alert the processor when it becomes ready. It does so by sending an interrupt request to
the processor. Since the processor is no longer required to continuously poll the status of I/O devices, it can
use the waiting period to perform other useful tasks. Indeed, by using interrupts, such waiting periods can
ideally be eliminated.

• PROCESS:
○ Complete current instruction
○ Save processing environment
○ Go to service routine
○ Service interrupt
○ Return - restore environment
• Interrupts can occur at any time. This is why the proper procedure with using the stack must be observed.

• Interrupts bear considerable resemblance to subroutine calls.


What if there are multiple devices? How do we sort them out (priority)?
• Let us now consider the situation where a number of devices capable of initiating interrupts are connected to
the processor. Because these devices are operationally independent, there is no definite order in which they
will generate interrupts. For example, device X may request an interrupt while an interrupt caused by device
Y is being serviced, or several devices may request interrupts at exactly the same time.

• We need to ask the following questions:


○ How can the processor determine which device is requesting an interrupt?
○ Given that different devices are likely to require different interrupt-service routines, how can the
processor obtain the starting address of the appropriate routine in each case?
○ Should a device be allows to interrupt the processor while another interrupt is being services?
○ How should two or more simultaneous interrupt requests be handles?

• The means by which these issues are handled vary from one computer to another, and the approach taken is
an important consideration in determining the computer's suitability for a given application.

• When an interrupt request is received it is necessary to identify the particular device that raised the request.
• Furthermore, if two devices raise interrupt requests at the same time, it must be possible to break the tie and
select one of the two requests for service. When the interrupt-service routine for the selected device has been
completed, the second request can be serviced.

• The information needed to determine whether a device is requesting an interrupt is available in its status
register. When the device raises an interrupt request, it sets to 1 a bit in its status register, which we will call
the IRQ bit.

• The simplest way to identify the interrupting device is to have the interrupt-service routine poll all I/O
devices in the system.
• The first device encountered with its IRQ bit set to 1 is the device that should be serviced. An appropriate
subroutine is then called to provide the requested service.
• The polling scheme is easy to implement. Its main disadvantage, however, is the time spent interrogating the
IRQ bits of devices that may not be requesting any service.
What if there are multiple devices? How do we sort them out (priority)?
• Single IRQ line

• All devices are


connected in parallel.
• This is simple - CPU
polls status bit of each
device.
• However, the priority
is assigned by order.
• It is indeed slow.
What if there are multiple devices? How do we sort them out (priority)?
• Multiple Interrupt Lines

• Have several different IRQ lines


• Priority is assigned to various lines (in the
event that a device of higher priority needs
service)
• Services higher priority requests first

○ What if these are more devices than levels?


▪ Use software polling to identify device
▪ Vectored interrupts - devices are "smart" and can indicate to the processor where to find
interrupt handler start address

○ To reduce the time involved in the polling process, a device requesting an interrupt may identify itself
directly to the processor. Then, the processor can immediately start executing the corresponding
interrupt-service routine. The term vectored interrupts refers to interrupt-handling schemes based on
this approach.
○ A device requesting an interrupt can identify itself if it has its own interrupt-request signal, or if it can
send a special code to the processor through the interconnection network.
○ The processor's circuits determine the memory address of the required interrupt service routine. A
commonly used scheme is to allocate permanently an area in the memory (somewhere in the bottom of
memory) to hold the addresses of interrupt-service routines. These addresses are referred to as
interrupt vectors, and they are said to constitute the interrupt-vector table.
○ When an interrupt request arrives, the information provided by the requesting device is used as a
pointer into the interrupt-vector table, and the address in the corresponding interrupt vector is
automatically loaded into the program counter.
○ A vectored interrupt is "smart" as it points us to where the service routine is. It is faster than software
polling. However, more hardware is needed to be "smart".
What if there are multiple devices? How do we sort them out (priority)?
• Hardware Polling (Daisy-Chain)

• If we have more than one smart device


• Allows for multiple requests
• Priority assigned by "physical" order of
devices

• If DEV1 is the one requesting, it will not


let the ACK pass by.
• If DEV2 was also requesting service, it
will also put its vector into the ACK and
then pass it along.

○ In many systems, there is a combination of methods


○ There are multiple IRQ lines
○ Daisy-chained INTR ACK
○ Vectored interrupts for each device

• Controlling Interrupts:
○ CPU
▪ Enabling of interrupts
▪ Priority structure (calculates if it is vectors or software polling)
▪ Masking of interrupts - CPU may only listen to certain priority
○ Device
▪ Enabling interrupt requests (generally a bit in the control register). Must be able to generate
interrupts
ARM Exceptions
• The ARM processor has two "normal" interrupt lines - I (normal) and F (fast interrupt request) lines which
can be disabled in the status register. These are interrupt-disable bits which determine whether the processor
is interrupted when an interrupt request is raised on the corresponding lines (IRQ and FIQ). The processor is
not interrupted if the disable bit is 1; it is interrupted if the disable bits is 0.
• Application programs run in User mode. However, user mode is not privileged and cannot manipulate these
bits (only supervisory mode can deal with them)
• System mode and the five exception modes are privileged modes. When the processor is in a privileged
mode, access to the status register is allowed so that the mode bits and the interrupt-disable bits can be
manipulated. This is done with instructions that are not available in User mode, which is an unprivileged
mode.

• There are five types of exceptions:


○ Fast interrupt (FIQ) mode is entered when an external device raises a fast-interrupt request to obtain
urgent service - priority 3
○ Ordinary interrupt (IRQ) is entered when an external devices raises a normal interrupt request - priority
4
○ Supervisor (SVC) mode/software interrupt is entered on powerup or reset, or when a user program
executes a Software Interrupt instruction (SWI) to call for an operating system routine to be executed
○ Memory access violation (Abort) mode is entered when an attempt by the current program to fetch an
instruction or a data operand causes a memory access violation - priority 2
○ Unimplemented instruction (Undefined) mode is entered when the current program attempts to execute
an unimplemented instruction - priority 6

• FIQ is intended for one device or a very small number of devices that require service - in FIQ, registers R8-
R12 (general) and R13/R14 are replaced - any changes to these registers will not affect the user registers after
the exception has been services so they do not have to be saved or restored (it will not affect registers from
the main routine)
• IRQ exceptions are for dealing with "normal interrupts"
• Only R13/R14 are replaced (along with the processor state (status register)) - any registers used in the
exception service must be saved and restored (on the stack).

• Normally, the processor is running in User or System mode with the "normal" 16 registers available.
• When an exception occurs, switch is made to one of five exception modes where some of the 16 registers are
replaced by an equal number of banked registers.
• At any one time, there is only one device that controls the bus - Bus Master
• Definition (Bus Master): The device that initiates data transfer requests on the bus

• Bus Arbitration is required if there are multiple devices that can be Bus Master.
• There are two main methods:
○ Centralized Arbitration
▪ CPU (or a special device) supervises control of the bus
▪ Example: Multiple DMA controllers
□ DMA device requests control of bus by asserting BUS REQUEST (BR)
□ Processor activates BUS GRANT (BG1) which is connected in a daisy-chain
□ Bus use is indicated with BUS BUSY (BBSY) signal
○ Distributed Arbitration
▪ If device are peers, then arbitration can be done without a central controller
▪ "Competition" starts when one or more devices activate the Start Arbitration signal
▪ Each device "bids" for control of the bus by placing the bits of its ID on an arbitration bus - all
competing devices look for their address on the bus
▪ The "highest" address wins control of bus
▪ Generally, that device will drop out of competition until all devices have had a chance to control
the bus
• A bus requires a set of rules, often called a bus protocol, that governs how the bus is used by various devices.
The bus protocol determines when a device may place information on the bus, when it may load the data on
the bus into one of its registers, and so on. These rules are implemented by control signals that indicate what
and when actions are to be taken.
• There are three classes of lines which make up the "Bus"
○ Data
○ Address
○ Control
• One control line, usually labelled R/ , specifies whether a Read or Write operation is to be performed. It
specified Read when set to 1 and Write when set to 0. When several data sizes are possible, such as byte,
halfword or word, the required size is indicated by other control lines.
• The bus control lines also carry timing information. They specify the times at which the processor and the I/O
devices may place data on or receive data from the data lines.
• A variety of schemes have been devised for the timing of data transfers over a bus.
• These can be broadly classified as either synchronous or asynchronous schemes.
• In any data transfer operation, one device plays the role of a master (the device that initiates data transfers by
issuing Read/Write commands on the bus - normally, the processor) and slave (the device addressed by the
master).

Synchronous Bus
• All devices derive timing information from a common control line called the bus clock.
• The signal on this line has two phases: a high level followed by a low level. The two phases constitute a clock
cycle. The first half of the cycle between the low-to-high and high-to-low transitions is often referred to as a
clock pulse.
• Clock pulses are evenly spaced and must be long enough to accommodate slowest devices.
• Consider a read operation from a device:

• The address and data lines are shown


as if they are carrying both high and
low signal levels at the same time.
• This is a common convention for
indicating that some lines are high
and some low, depending on the
particular address or data values
being transmitted.
• The crossing points indicate the
times at which these patterns change.
• A signal line at a level half-way
between the low and high signal
levels indicates periods during which
the signal is unreliable and must be
ignored by all devices.

• At time t0, the master places the device address on the address lines and sends a command on the control lines
indicating a Read operation. The command may also specify the length of the operand to be read.
• Information travels over the bus at a speed determined by its physical and electrical characteristics.
• The clock pulse width, t1-t0, must be longer than the maximum propagation delay over the bus. Also, it must
be long enough to allow all devices to decode the address and control signals, so that the addressed device
(the slave) can respond at time t1, by placing the requested input data on the data lines.
• At the end of the clock cycle, at time t2, the master loads the data on the data lines into one of its registers.
• To be loaded correctly into the register, data must be available for a period greater than the setup time of the
register. Hence, the period t2-t1 must be greater than the maximum propagation time on the bus plus the setup
time of the master's register.
Synchronous Bus
• A similar procedure is followed for a Write operation. The master places the output data on the data lines
when it transmits the address and command information. At time t2, the addressed device loads the data into
its data register.
• The timing diagram is an idealized representation - in reality, propagation delays on bus wires and in the
circuits of the devices cause different parts of the circuit to see signals at different times.

• The diagram above shows reality - two views of each signal, except the clock.
• Because signals take time to travel from one device to another, a given signal transition is seen by different
devices at different times.
• The top view shows the signals as seen by the master and the bottom view as seen by the slave.
• We assume that the clock changes are seen at the same time by all devices connected to the bus.
• System designers spend considerable effort to ensure that the clock signal satisfies the requirement.
Synchronous Bus
• Multiple Clock Cycle Transfers
○ In the previous cycle, all transfers were done in one clock cycle - as we noted, it is simple but clock
cycle must be long enough to accommodate the slowest device, that is, its slow transfer rate. This
forces all devices to operate at the speed of the slowest device.
○ Also, the processor has no way of determining whether the addressed device has actually responded.
At t2, it simply assumes that the input data are available on the data lines in a Read operation, or that
the output data have been received by the I/O device in a Write operation. If, because of a malfunction,
a device does not operate correctly, the error will not be detected.

○ SOLUTION: Add more signals to allow device to tell master when it is ready
○ To overcome these limitations, most buses incorporate control signals that represent a response from
the device. These signals inform the master than the slave has recognized its address and that it is
ready to participate in a data transfer operation.
○ They also make it possible to adjust the duration of the data transfer period to match the response
speeds of different devices. This is often accomplished by allowing a complete data transfer operation
to span several clock cycles. Then, the number of clock cycles involved can vary from one device to
another.

○ During clock cycle 1, the master sends address and command information on the bus, requesting a
Read operation. The slave receives this information and decodes it. It begins to access the requested
data on the active edge of the clock at the beginning of clock cycle 2. We have assumed that due to the
delay involved in getting the data, the slave cannot respond immediately. The data become ready and
are placed on the bus during clock cycle 3. The slave asserts a control signal called Slave-ready at the
same time. The master, which has been waiting for this signal, loads the data into the register at the end
of the clock cycle. The slave removes its data signals from the bus and returns its Slave-ready signal to
the low level at the end of cycle 3. The bus transfer operation is now complete, and the master may
send new address and command signals to start a new transfer in clock cycle 4.
○ The Slave-ready signal is an acknowledgement from the slave to the master, confirming that the
requested data have been placed on the bus. It also allows the duration of a bus transfer to change from
one device to another.
Asynchronous Bus
• An alternative scheme for controlling data transfers on a bus is based on the use of a handshake protocol
between the master and the slave to do transfers (no central clock).
• Definition (Handshake): A handshake is an exchange of command and response signals between the master
and the slave. It is a generalization of the way the Slave-ready signal is used in the previous figure.
• A control line called Master-ready is asserted by the master to indicate that it is ready to start a data transfer.
The slave responds by asserting Slave-ready.
Asynchronous Bus
Asynchronous Bus

• The timing for an output operation is essentially the same as for an input operation.
• In this case, the master places the output data on the data lines at the same time that it transmits the address
and command information.
• The selected slave loads the data into its data register when it receives the Master-ready signal and indicates
that it has done so by setting the Slave-ready signal to 1. The remainder of the cycle is similar to the input
operation.
• The I/O interface of a device consists of the circuitry needed to connect that device to the bus.
• On one side of the interface are the bus lines for address, data and control.
• On the other side are the connections needed to transfer data between the interface and the I/O device. This
side is called a port, and it can be either a parallel or a serial port.

• What properties do busses have?


○ Many can "listen" to the bus
○ Only one device can drive the bus at a time
○ Devices should always be "isolated" from the bus itself

• Inputs to registers can always "listen" to the bus. They are only clocked when they are addressed
• Output from register are only transferred to the bus when required (tri-state drivers).
• Each device will require address decoding and control signal generation.

• A parallel port transfers multiple bits of data simultaneously to or from the device.
• A serial port sends and receives data one bit at a time.
• Communication with the processor is the same for both formats; the conversion from a parallel to a serial
format and vice versa takes place inside the interface circuit.

• Recall the functions of an I/O interface:


○ Provides a register for temporary storage of data
○ Includes a status register containing status information that can be accessed by the processor
○ Includes a control register that holds the information governing the behaviour of the interface
○ Contains address-decoding circuitry to determine when it is being address by the processor
○ Generates the required timing signals
○ Performs any format conversion that may be necessary to transfer data between the processor and the
I/O device, such as parallel-to-serial conversion in the case of a serial port
General Parallel Port
Consider a general 8 bit parallel port where each input/output can be programmed

M68230 Parallel Interface/Timer


• Has three 8-bit bidirectional I/O ports
• 24 bit timer
• Port General Control Register (PGCR)
• Direction Control Registers
Interconnection Standards
• A typical desktop or notebook computer has several ports that can be used to connect I/O devices. Standard
interfaces have been developed to enable I/O devices to use interfaces that are independent of any particular
processor.
• For example, a memory key that has a USB connector can be used with any computer that has a USB port.
• Most standards are developed by a collaborative effort among a number of companies. In many cases, the
IEEE develops these standards further and published them as IEEE standards.

• Why do we need standards?


○ Allow devices from different manufacturers to interact
○ Allow different types of devices to operate on the same system

Examples of Standards:
• RSC 232C
○ CCITT V.24
○ Standard for Serial Communications
○ Synchronous and asynchronous modes
○ Specifies electrical, physical, mechanical, procedural aspects of communications

• Nubus
○ 32 bit architecture
○ Up to 16 devices on one backplane
○ Designed for high speed/low cost
○ 10MHz clock
○ 32 bit shared address space
○ Single master - all other devices are slaves

• Multibus
○ INTEL initiative
○ 8, 16, 32 bit asynchronous transfers
○ IEEE 796 (16 bit)
○ IEEE 1296 (32 bit)

• IEEE 488
○ Standard for laboratory instrumentation
○ 8 bit parallel
▪ Up to 250Kbytes/sec
▪ Up to 20 metres
○ Modes
▪ Listener
▪ Talker
▪ Controller
Interconnection Standards

Examples of Standards:
• Peripheral Component Interconnect (PCI) Bus
○ First introduced in 1992
○ One of the first standard interface that was independent of a particular processor
○ This was developed as a low-cost, processor-independent bus.
○ It is housed on the motherboard of a computer and used to connect I/O interfaces for a wide variety of
devices.
○ A device connected to the PCI bus appears to the processor as if it is connected directly to the
processor bus.
○ Its interface registers are assigned addresses in the address space of the processor.
○ First "Plug-and-play" - connect device to the bus, software takes care of the rest

○ The PCI bus is connected to the processor bus


via a controller called a bridge.
○ The bridge has a special port for connecting the
computer's main memory.
○ The bridge translates and relays commands and
responses from one bus to the other and
transfers data between them.
○ PCI bridge gives a separate physical connection
for main memory.
○ For example, when the processor sends a Read
request to an I/O device, the bridge forwards
the command and address to the PCI bus. When
the bridge receives the device's response, it
forwards the data to the processor using the
processor bus.

○ PCI supports three address spaces:


▪ Memory
▪ I/O
▪ Configuration - intended for Plug-and-Play capability. Each device has a configuration ROM
that holds information about the device
○ The system designer may choose to use memory-mapped I/O even with a processor that has a separate
I/O address space. In fact, this is the approach recommended by the PCI standard for wider
compatibility.
○ A 4-bit command accompanies a data transfer to indicate which space is being used.
○ Data transfers on a computer bus often involve bursts of data rather than individual words. Words
stored in successive memory locations are transferred directly between the memory and an I/O device,
which acts as a bus master.
○ Bus Master is the initiator (either processor or DMA controller). Address device is target. Burst of
data transferred is called a transaction.
○ Signaling is a master/slave relationship.
○ The PCI bus is designed primarily to support multiple-word transfers. A read or Write operation
involving a single word is simply treated as a burst of length one.
Interconnection Standards

Examples of Standards:
• Small Computer System Interface (SCSI)
○ This refers to a standard bus defined by the American National Standards Institute - ANSI X3.131
○ The SCSI bus may be used to connect a variety of devices to a computer. It is particularly well-suited
for use with disk drives and is often found in installations such as institutional databases or email
systems where many disks drives are used.
○ In the original specification of the SCI standard, devices are connected to a computer via a 50-wire
cable, which can be up to 25 metres in length and can transfer data at rates of up to 5 Megabytes/s
(increased to 620Megabytes/s in later version). This speed is dependent on the number and length of
cable (few and shorter cables = higher rate)
○ Data are transferred either 8 bits of 16 bits in parallel, using clock speeds of up to 80MHz.

○ The bus may use single-ended


transmission, where each signal uses one
wire, with a common ground return for all
signals.
○ Address space of SCSI devices is not part
of processor address space
○ Up to 8 (or 16) devices can be connected
○ Acts as a DMA device to/from memory
○ A SCSI device is either an initiator or a
target.
○ An initiator has the ability to address a
particular target
○ There are not address lines - data lines are
used to identify devices
○ Control of bus is decided by arbitration
○ A device "bids" for control of the bus by
placings its ID on the 8 (or 16) data lines

Select device 5

○ After arbitration/selection -
Master/slave relationship to
transfer data
○ BSY is released when data transfer
is finished
○ If connection is suspended - it can
be reselected - target device now
acquires the bus and selects
initiator
Interconnection Standards

Examples of Standards:
• Universal Serial Bus (USB)
○ Collaborative standard developed by computer and telecommunications industry
○ A large variety of devices are available with a USB connector
○ The commercial success of the USB is due to its simplicity and low cost.
○ The original USB specification supports two speeds of operation, called low-speed (1.5 Megabits/s)
and full-speed (12 Megabits/s).
○ Later USB2, called High-Speed USB was introduced. It enabled data transfers at speed up to 480
Megabits/s.
○ As I/O devices continued to evolve with even higher speed requirements, USB 3 (called Superspeed)
was developed. It supports data transfer rates up to 10 Gigabits/s.
○ The USB C, Thunderbolt, was developed, which supported up to 40 Gigabits/s
○ Data is transmitted in serial form. Clock and data are combined to prevent any skew problems

○ The USB has been designed to meet several key objectives:


▪ Provide a simple, low-cost, and easy to use interconnection system
▪ Accommodate a wide ranged I/O devices and bit raters, including Internet connections, and
audio and video applications
▪ Enhance user convenience through a "plug-and-play" mode of operation

○ USB Architecture
▪ The USB uses point-to-point connections and a serial transmission format.
▪ When multiple devices are connected, they are arranged in a tree structure.
▪ Each node of the tree has a device called a hub, which acts as an intermediate transfer point
between the host computer and the I/O devices.
▪ At the root of the tree, a root hub connects the entire tree to the host computer.
▪ The leaves (functions) of the tree are the I/O devices.

▪ The tree structure makes it possible to


connect many devices using simple point-
to-point serial links.
▪ In "normal" operation, a hub copies any
messages it receives on its upstream
connection to all of its downstream
connections. For example: a message
from the host computer is broadcasted to
all of the I/O devices but only the
addressed device(s) should listen
▪ A message from an I/O device is only
sent upstream (I/O devices cannot hear
each other).
Interconnection Standards

Examples of Standards:
• Universal Serial Bus (USB)

○ USB Architecture

▪ If I/O devices are allowed to send


messages at any time, two
messages may reach the hub at the
same time and interfere with each
other. For this reason, the USB
operates strictly on the basis of
polling.
▪ A device may send a message only
in response to a poll message from
the host processor. Hence, no two
devices can send messages at the
same time.

▪ This restriction allows hubs to be


simple, low-cost devices
▪ The host is always in control. If the
host wants to know something, it is
done through polling. It polls the
sender, receiver and then transmits
it. There is no direct
communication between devices.

▪ Arbitration: USB works on device polling only. I/O devices are only allowed to respond when
polled.
▪ This allows for simple, inexpensive hubs (no real arbitration required).
▪ Each device on the USB, whether it is a hub or an I/O device is assigned a 7-bit address. This
address is local to the USB tree and is not related in any way to the processor's address space.
▪ The root hub of the USB, which is attached to the processor, appears as a single device.
▪ The host software communicates with individual devices by sending information to the root
hub, which it forwards to the appropriate device in the USB tree.

▪ When a device is first connected to a hub, or when it is powered on, it has the address 0.
▪ Periodically, the host polls each hub to collect status information and learn about new devices
that may have been added or disconnected.
▪ When the host is informed that a new device has been connected, it reads the information in a
special memory in the device's USB interface to learn about the device's capabilities.
▪ It then assigns the device a unique USB address and writes that address in one of the device's
interface registers. It is this initial connection procedure that gives the USB its plug-and-play
capability.
▪ When happens if there are multiple USB devices? Processor will slow down (host will have to
poll occasionally).
Interconnection Standards

Examples of Standards:
• Universal Serial Bus (USB)

○ USB Data
▪ There are two types of packets exchanged:
□ Control packets - address, acknowledgement, errors, etc.
□ Data packets - actual data
▪ Packets have PID that identifies type of packet (four bits used to identify type - transmitted
twice).

▪ The host initiates a transmission with a token then data


packets.
▪ The hub checks for errors then forwards transmission to
down-stream devices
Interconnection Standards

Examples of Standards:
• Universal Serial Bus (USB)

○ USB Isochronous Traffic


▪ An important feature of the USB is its ability to support the transfer of isochronous data in a
simple manner.
▪ Isochronous data need to be transferred at precisely timed regular intervals.
▪ To accommodate this type of traffic, the root hub transmits a uniquely recognizable sequence of
bits over the USB tree every millisecond.
▪ This sequence of bits, called a Start of Frame character, acts as a marker indicating the
beginning of isochronous data, which are transmitted after this character.
▪ Thus, digitized audio and video signals can be transferred in a regular and precisely timed
manner.
▪ USB creates frames 1ms - initiated with SOF packet every 1ms.
▪ Devices can then send data (byte) or a regular basis
• The maximum size of the memory that can be used in any computer is determined by the addressing scheme.
For example: a computer than generates 16-bit addresses is capable of addressing up to 64K memory
locations.
• The memory is usually designed to store and retrieve data in word-length quantities. Consider, for example, a
byte-addressable computer whose instructions generate 32-bit addresses. When a 32-bit address is sent from
the processor to the memory unit, the high-order 30 bits determine which word will be accessed. If a byte
quantity is specified, the low-order 2 bits of the address specify which byte location is involved.

• The connection between the processor and its memory consists of address, data and control lines. The
processor uses the address lines to specify the memory location involved in a data transfer operation and uses
the data lines to transfer the data. At the same time, the control lines carry the command indicating a Read or
a Write operation and whether a byte or a word is to be transferred. The control lines also provide the
necessary timing information and are used by the memory to indicate when it has completed the requested
operation. When the processor-memory interface receives the memory's response, it asserts the MFC signal.
This is the processor's internal control signal that indicates that the requested memory operation has been
completed. When asserted, the processor proceeds to the next step in its execution sequence.

• Memory is built from a collection of memory cells. For example: 1 bit = 1 cell.
• Side node: What device can store 1 bit? Flip Flop. However, if we were to use flip flops to build memory, it
would be too huge.
• Cells are grouped into bytes/words
• A memory address returns a selected group of memory cells.
• Memory can be classified according to:
○ Primary or Secondary Memory
○ Access Memory - Random/Sequential
○ Memory Technology - Bipolar, CMOS, Magnetic, Optical
○ Memory Retention - Static/Dynamic
○ Memory Type - R/WM, ROM, EPROM, EEPROM, FLASH, SSD
Today:
• 16GB RAM costs $100CDN (approximately $0.007/MB)
• 2TB hard drive costs $100CDN (approximately $0.00005/MB)
• 100 4.7GB writeable DVDs cost $30CDN (approximately $0.000064/MB)
How it works? Memory Side Note
Space
UNIVAC Timed serial acoustic delay line 10-91 bit Why mercury?
Mercury Tube "words" per Because it is pretty dense. It
tube must be very well-
100 tubes = controlled at a specific
91000 bits of temperature.
memory
Magnetic Long rotating cylinder with magnetic coating. Some drum spun at speeds
Drums Multiple read heads were placed along cylinder up to 75000rpm.
"tracks". Instructions had fields for location of up
to 3 operands and next instruction location.
Creative program could greatly improve
performance.
Magnetic Core Ferro-magnetic disks were used as memory cells.
Wires were wrapped around in x/y plane plus a
sense wire. Hysteresis properties of disks would
remember the direction of polarization when
current from x/y lines were activated.
Capacitors A large bank of capacitors were used to create Developed in Bletchley
"dynamic memory" - capacitors were either Park (British crypto unit) -
charged (a "1") or not. Charge would slowly leak developed some of the first
out making it dynamic. computers used for
analyzing codes
Semiconductor Use transistors to form memory cells (F/F). VLSI Originally introduced in the
technology allowed many to be placed on one chip. late 1960s
Individual cells can be placed in a 2-D array

2-D Memory

• Memory cells are usually organized in the


form of an array, in which each cell is
capable of storing one bit of information.
• Each row of cells constitutes a memory
word, and all cells of a row are connected
to a common line (word line), which is
driven by the address decoder on the chip.
• The cells in each column are connected to a
Sense/Write circuit by two bit lines, and
the Sense/Write circuits are connected to
the data input/output lines of the chip.
• The memory circuit in the figure stores 128
bits and requires 14 external connections
for address, data and control lines. It also
needs two lines for power supply and
ground connections.
• Each address accesses one bit in array
• Problem - as size of memory gets larger,
address decoding gets much more complex
• 1K would require a 10 to 1024 bit decoder.
This circuit can be organized as a 128 x 8
memory, requiring a total of 19 external
connections.
How it works? Memory Side Note
Space
UNIVAC Timed serial acoustic delay line 10-91 bit Why mercury?
Mercury Tube "words" per Because it is pretty dense. It
tube must be very well-
100 tubes = controlled at a specific
91000 bits of temperature.
memory
Magnetic Long rotating cylinder with magnetic coating. Some drum spun at speeds
Drums Multiple read heads were placed along cylinder up to 75000rpm.
"tracks". Instructions had fields for location of up
to 3 operands and next instruction location.
Creative program could greatly improve
performance.
Magnetic Core Ferro-magnetic disks were used as memory cells.
Wires were wrapped around in x/y plane plus a
sense wire. Hysteresis properties of disks would
remember the direction of polarization when
current from x/y lines were activated.
Capacitors A large bank of capacitors were used to create Developed in Bletchley
"dynamic memory" - capacitors were either Park (British crypto unit) -
charged (a "1") or not. Charge would slowly leak developed some of the first
out making it dynamic. computers used for
analyzing codes
Semiconductor Use transistors to form memory cells (F/F). VLSI Originally introduced in the
technology allowed many to be placed on one chip. late 1960s
Individual cells can be placed in a 2-D array

2-D Memory

• Memory cells are usually organized in the


form of an array, in which each cell is
capable of storing one bit of information.
• Each row of cells constitutes a memory
word, and all cells of a row are connected
to a common line (word line), which is
driven by the address decoder on the chip.
• The cells in each column are connected to a
Sense/Write circuit by two bit lines, and
the Sense/Write circuits are connected to
the data input/output lines of the chip.
• The memory circuit in the figure stores 128
bits and requires 14 external connections
for address, data and control lines. It also
needs two lines for power supply and
ground connections.
• Each address accesses one bit in array
• Problem - as size of memory gets larger,
address decoding gets much more complex
• 1K would require a 10 to 1024 bit decoder.
This circuit can be organized as a 128 x 8
memory, requiring a total of 19 external
connections.
• Static memory retain their contents as power is applied - still volatile

• This is how a static RAM (SRAM) cell


may be implemented. Two inverters are
cross-connected to form a latch. The
latch is connected to two bit lines by
transistors T1 and T2. These transistors
act as switches that can be opened or
closed under control of the word line.
When the word line is at ground level,
the transistors are turned off and the latch
retains its state.
• For example, if the logic value at point X
is 1 and at point Y is 0, this state is
maintained as long as the signal on the
word line is at ground level. Assume that
this state represents the value 1.

• This is a CMOS realization of the cell.


Transistor pairs T3, T5 & T4, T6 form
the inverters in the latch.
• Continuous power is needed for the cell
to retain its state. If power is interrupted,
the cell's contents are lost. When power
is restored, the latch settles into a stable
state, but not necessarily the previous
state. Hence SRAMs are said to be
volatile since their contents are lost
when power is interrupted.
• A major advantage of CMOS SRAMs is
their very lower power consumption,
because current flows in the cell only
when the cell is being accessed (no
current flows when the cell is not
active).
• SRAMs can be accessed very quickly (a
few nanoseconds).
• Cells are typically 6 transistors - low
packing density
• Less expensive, higher density RAMs which can be
implemented with simpler cells. They do not retain
their state for a long period, unless they are accessed
frequently for Read/Write operations.
• Cell is formed with an isolation transistor and a
capacitor.
• Memory is the storage of charge in the capacitor but
can be maintained for only tens of milliseconds.
• Since the cell is required to store information for a
much longer time, its contents must be periodically
refreshed by restoring the capacitor to its full value.
• This occurs when the contents of the cell are read or
when new information is written to it.
• Since cells are very simple, packing density can be
quite high.
• To reduce pin counters, address is usually broken up into row/column components
• Control signals tell device which is to be latched (Column Address Strobe CAS and Row Address Strobe
RAS)
• Timing is controlled asynchronously (self-timed) but it must allow for delays in the circuit.

• The 4096 cells in each row are divides


into 512 groups of 8, forming 512 bytes of
data.
• In the DRAM shown, all cells in the row
are selected and read - only 8 bits are put
out on the data lines (column address)
• We could latch all 512 bytes of a row and
read them out the columns sequentially
• If these are sequential addresses - Fast
Page Mode.

• Improvements in DRAM structure allow


memory access to be directly synchronized to a
clock signal. This is known as SDRAMs.
• The cell array is the same as in asynchronous
DRAMs. The distinguishing feature is the use
of a clock signal, the availability of which make
it possible to incorporate control circuitry on
the chip that provides many useful features.
• For example: SDRAMs have build-in refresh
circuitry with a refresh counter to provide the
address of the rows to be selected for
refreshing. As a result, the dynamic nature of
these memory chips is almost invisible to the
user.
• Data connections and addresses buffered by
registers.
• All contents of Row are loaded into latch and
outputted sequentially.
• Internally, the Sense/Write amplifiers function
as latches. A Read operation causes the contents
of all cells in the selected row to be loaded into
these latches.
• The data in the latches of the selected column are transferred into the data register, thus becoming available
on the data output pins.
• The buffer registers are useful when transferring large blocks of data at very high speed.
• SDRAMs have several different modes, example: burst mode - burst operations of different lengths.
• The figure shows a timing
diagram. First, the row address is
latched under control of the RAS
signal.
• The memory typically takes 5 or
6 clock cycles (we use 2 for
simplicity) to activate the
selected row.
• Then, the column address is
latched under control of the CAS
signal. After a delay of one clock
cycle, the first set of data bits is
placed on the data lines. The
SDRAM automatically
increments the column address to
access the next three sets of bits
in the selected row, which are
placed on the data lines in the
next 3 clock cycles.
• Synchronous DRAMs can deliver data at a very high rate since all the control signals needed are generated
inside the chip

Latency
• Amount of time required to write/read a byte/word of data to/from memory
• The time required to transfer depends also on the rate at which successive words can be transferred and on
the size of the block. The time between successive words of a block is much shorter than the time needed to
transfer the first word.
• Single word access is worst case - burst mode "looks" much better - depends on over head to start transfer
(large) then time to transfer successive words (short)

Bandwidth
• A measure of how much data can be transferred per unit time (1 second)
• It depends on the speed of access to data (latency) and number of bits that can be transferred in parallel
(number of wires and speed of the link).

Double-Data-Rate SDRAM
• Faster version of SDRAM. New organizational and operational features to make it possible to achieve high
data rates during block transfers.
• The key idea is to take advantage of the fact that a large number of bits are accessed at the same time inside
the chip when a row address is applied.
• Uses two interleaved memory banks for back and forth switching.
• Transfers data on both edges of the clock (rising and falling).
• It has the same latency as SDRAM - double bandwidth (at best)
• Most effective in large block transfers - no advantage in individual word transfers
• Current standard is DDR3 (64 bit transfers)

Content Addressable Memory


• Supply data word - memory is searched and if present, memory returns list of locations where it can be found
(Do you have this in your memory?)
• Ternary CAM ("wild cards")
○ Addition of ability to include 'Don't care' for some bits in search
○ Important in routers (e.g. CISCO routers) to send memory packets to a specific location (Do I know
how to get there?)
Multi-Port Memories
• Some memory systems (cache applications) allow simultaneous, multiple port access to memory
• Expensive since it requires separate data pathways

RAMBUS Memory
• Used in gaming systems for extreme speeds
• It achieves a high data transfer rate by providing a high-speed interface between the memory and the
processor.
• In order to increase the bandwidth of this connection is to use few wires with a higher clock speed.
• The key feature is the use of differential signaling technique to transfer data to and from the memory chips.
Signals are transmitted using small voltage swings (0.3V) about and below a reference voltage.
• In previous cases - outputs were around 0V to Vcc (logic 1 is approximately 5V)
• Bus width is 8 or 16 (dual channel) bits
• Uses packets for transfer - no separate address lines
• Memory can only be read
○ Different technologies:
▪ Mask programmable ROM (done at the factory, expensive setup, cost to purchase is cheap)
▪ Fuse programmable ROM (PROM) (programmed in "burner", cheap setup, expensive cost to
purchase). This has no contents when delivered since you will be writing the contents.

ROM (Fuse-Programmable)

• Information can be written into it only once at the time of


manufacture.
• A thin piece of metal is tied to ground (contents 0)
• If you put into device and burn the fuse, the contents is 1.

Erasable PROM - EPROM


• EPROM is similar to ROM but cells can be programmer to on or off. In this case, applying a voltage will put
a 1 onto the cell.
• A charge is "injected" into a region that is isolated. Electrons will "jump" the barrier and go onto the
embedded part of the chip.
• Contents can be erased by exposure to UV light and reprogrammed (usually in a special programming
device)
• They are therefore capable of retaining stored information for a long time.

Electronically Erasable PROM - EEPROM


• Found in operating systems
• Do not need to be removed from circuit - apply a special voltage to selectively erase and program contents
• Electrons jumping barrier causes stress and causes them to wear out
• For example, flash is an extension of EEPROM but generally, cells are written in blocks. Flash devices have
greater density which leads to higher capacity and a lower cost per bit. They require a single power supply
voltage and consume less power in their operation.
• If you are developing a new device, you will start the EEPROM then PROM then Mask Programmable
• Intel 3D Xpoint
○ Non Volatile Memory (When you shut computer down, everything is still there. Opposite to DRAM)
○ Based on bulk resistance change
○ Far faster than current SSD technology
○ 3D-stacked memory
• Generally, memory chips are not large enough to form the entire memory
• Use a series of chips to create the memory

• Memory access time is defined as the time between


access request and completion (latency + time to
get our first data)
• Memory cycle time is the time between successive
operations (larger than access time). For example:
the time between two successive Read operations.
• Memory cycle > Memory access
• How can we speed up access?
• Consider accessing consecutive memory locations.
You must wait for each successive access to
complete.

• If we interleave modules so that consecutive word access occur in different modules - we can start accesses
simultaneously.

• Advantage - much faster!


• Example:
○ DRAM - 8 clock cycles for initial
access then 4 cycles per successive
word
○ 8 word transfer for a single module
requires: 8 + (7 x 4) = 36 cycles
○ With 4 interleaved modules: 8 + 4
= 12 cycles
• Disadvantage: must populate entire
address space
• Memory locations must be available in all
possible locations as well and this costs
money.

You might also like