Mother Board

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 60

• Introduction

• The Boot Process


• BIOS and Boot Sequences
• BIOS Manufacturers
• BIOS and CMOS

Introduction
Inside every PC out there is BIOS, which stands for Basic Input Output System. In a nutshell,
BIOS is software that interacts between a computers hardware and the operating system
and software applications. There are several types of BIOS', ranging from the motherboard
ROM BIOS to adapter BIOS' such as video BIOS, drive controller BIOS, network adapter BIOS,
SCSI adapter BIOS, etc... These BIOS' are the lowest level of software in a computer
providing a set of small programs or software routines that allow the hardware of a
computer to interact with the operating system by a set of standard calls.
I hope to provide a through understanding of how the BIOS works and leave you with a
better understanding of it's interworkings. At the same time, I hope to show how complex a
BIOS is in relation to it's relationship with the operating system and the software
applications you use everyday. Enjoy.
The Boot Process
To get to the operating system, a computer must first boot from the BIOS. The BIOS
performs a number of tasks when a computer is started. From initializing the
microprocessor to initializing and testing hardware to starting the operating system.
Starting a computer is not a simple task. It's a methodical process that is performed every
time power is applied to computer. Here is a detailed description of the boot process. This
process will vary with different computers and different BIOS', but the overall goal is the
same. When you first turn on a computer the very first operation performed by the CPU is to
read the address space at FFFF:0000h. This address space it reads from is only 16 bytes,
which is not nearly enough space to house the BIOS found on a motherboard. Instead, this
location contains a special instruction called a jump command (JMP) that tells the processor
where to go to find and read the actual BIOS into memory. The process of the processor
reading the jump instruction and redirection to the actual BIOS is called the bootstrap or
boot. So, when you apply power, it's not the operating system that's working. It's the BIOS.
First, I want to get something straight. The CMOS and the BIOS are two different things.
The BIOS refers to the firmware instructions that are located on the BIOS ROM. CMOS refers
to the low-power RAM that holds the system's setup parameters. The BIOS reads the CMOS
RAM into memory at boot up and provides the setup routine that allows you to change the
contents of CMOS, but the CMOS RAM/RTC device is a totally different IC. The CMOS holds
the information provided by the BIOS. This is why you "lose" the settings of a system when
the battery dies or you clear the CMOS through a jumper on the motherboard.
With today's high performance 32 bit operating systems, the BIOS becomes less used, but it
is still there, always interacting with the operating system. Disk access, for example, is done
through the operating system with 32-bit routines, whereas the BIOS is using 16-bit routines.
Although the BIOS provides VGA support, Windows and other 32-bit operating systems use
software device drivers to work with the hardware. Early OS's, like DOS, worked with the
BIOS. DOS relied on the BIOS to perform most functions, like displaying characters on the
screen or sending output to the printer, reading input from the keyboard and other essential
tasks. These drivers, which operate in protected mode(since they aren't written for real
mode, they are able to use memory above the 1MB barrier that real mode provides), allow
for several enhancements. They can access more memory, can be written in 32-bit code for
optimized execution and are not limited to the amount of space available to their code.
However, regardless of OS, whether it's Windows 2000, Linux or DOS, the BIOS and the
operating system still interact with each other.
Here is a basic rundown of what is happening:
1. Power is applied to the computer
When power is applied to the system and all output voltages from the power supply are
good, the power supply will generate a power good signal which is received by the
motherboard timer. When the timer receives this signal, it stops forcing a reset signal to the
CPU and the CPU begins processing instructions.
2. Actual boot
The very first instruction performed by a CPU is to read the contents of a specific memory
address that is preprogrammed into the CPU. In the case of x86 based processors, this
address is FFFF:0000h. This is the last 16 bytes of memory at the end of the first megabyte
of memory. The code that the processor reads is actually a jump command (JMP) telling the
processor where to go in memory to read the BIOS ROM. This process is traditionally
referred to as the bootstrap, but now commonly referred to as boot and has been broadened
to include the entire initialization process from applying power to the final stages of loading
the operating system.
3. POST
POST stands for Power On Self Test. It's a series of individual functions or routines that
perform various initialization and tests of the computers hardware. BIOS starts with a series
of tests of the motherboard hardware. The CPU, math coprocessor, timer IC's, DMA
controllers, and IRQ controllers. The order in which these tests are performed varies from
mottherboard to motherboard. Next, the BIOS will look for the presence of video ROM
between memory locations C000:000h and C780:000h. If a video BIOS is found, It's
contents will be tested with a checksum test. If this test is successful, the BIOS will initialize
the video adapter. It will pass controller to the video BIOS, which will inturn initialize itself
and then assume controller once it's complete. At this point, you should see things like a
manufacturers logo from the video card manufacturer video card description or the video
card BIOS information. Next, the BIOS will scan memory from C800:000h to DF800:000h in
2KB increments. It's searching for any other ROM's that might be installed in the computer,
such as network adapter cards or SCSI adapter cards. If a adapter ROM is found, it's
contents are tested with a checksum test. If the tests pass, the card is initialized. Controller
will be passed to each ROM for initialization then the system BIOS will resume controller
after each BIOS found is done initializing. If these tests fail, you should see a error message
displayed telling you "XXXX ROM Error". The XXXX indicates the segment address where
the faulty ROM was detected. Next, BIOS will begin checking memory at 0000:0472h. This
address contains a flag which will tell the BIOS if the system is booting from a cold boot or
warm boot. A value of 1234h at this address tells the BIOS that the system was started from
a warm boot. This signature value appears in Intel little endian format , that is, the least
significant byte comes first, they appear in memory as the sequence 3412. In the event of a
warm boot, the BIOS will will skip the POST routines remaining. If a cold start is indicated,
the remaining POST routines will be run. During the POST test, a single hexadecimal code
will be written to port 80h. Some other PC's send these codes to other ports however.
Compaq sends them to port 84h, IBM PS/2 model 25 and 30 send them to port 90h, model
20-286 send them to port 190h. Some EISA machines with an Award BIOS send them to port
300h and system with the MCA architecture send them to port 680h. Some early AT&T,
Olivetti, NCR and other AT Clones send them to the printer port at 3BC, 278h or 378h. This
code will signify what is being tested at any given moment. Typically, when the BIOS fails
at some point, this code will tell you what is failing.
4. Looking for the Operating System
Once POST is complete and no errors found, the BIOS will begin searching for an
operating system. Typically, the BIOS will look for a DOS Volume Boot Sector on the floppy
drive. If no operating system is found, it will search the next location, the hard drive C. If
the floppy drive (A), has a bootable floppy in it, the BIOS will load sector 1, head 0, cylinder
0 from the disk into memory starting at location 0000:7C00h. The first program to load will
be IO.SYS, then MSDOS.SYS. If the floppy does not contain a DOS volume boot sector, then
BIOS will next search the computers hard drive for a master partition boot sector and load it
into memory at 0000:7C00h. There are some occasions in which you will encounter
problems with the proper loading of the Volume Boot Sector. Below are some of those:
A. If the first byte of the Volume Boot Sector is less than 6h, then you will receive a
message similar to "Diskette boot record error".
B. If the IO.SYS or MSDOS.SYS are not the first two files in the Volume Boot Sector,
then you will see a message similar to "Non-system disk or disk error".
C. If the Volume Boot Sector is corrupt or missing, you will get a message similar to
"Disk boot failure"
Once the BIOS has searched for a bootable floppy device, it should turn it's attention to the
next boot device it's programmed to look for. The next device is typically the hard drive, or
C. Like a floppy drive, the BIOS will attempt to load the Volume Boot Sector from sector 1,
head 0, cylinder 0 from the Master Boot Sector, or MBS, into memory starting at
0000:7C00h. The BIOS will check the last two bytes of the MBS. They should be 55h and
AAh respectively. If they are not, then you will receive an error message similar to "No boot
device available" and "System initialization will halt". If they are correct, then the BIOS will
continue the loading process. At this point, the BIOS will scan the MBR in search of any
extended partitions. If any extended partitions are identified, the original boot sector will
search for a boot indicator byte which indicates a active and bootable partition. If it cannot
find one, you will receive a message similar to "Invalid partition table".
At this, once a active partition is found, the BIOS will search for a Volume Boot Sector on the
bootable partition and load the VBS into memory and test it. If the VBS is not readable or
corrupt, you will see a message similar to "Error loading operating system". At the point,
the BIOS will read the last two bytes of the VBS. These bytes should be 55h and AAh
respectively. If they are not, then you will see a message similar to "Missing operating
system" It is at this point that the BIOS will begin loading of the operating system.

Plug and Play


Intel and Microsoft took the first stab at Plug and Play with the specification for ISA on May
28, 1993. Later, Compaq, Phoenix and Intel developed the BIOS specification for Plug and
Play, first released on November 1, 1993. Plug and Play requires three elements of the
system be written to it's standards. The motherboard BIOS, the operating system, and the
boards and peripherals attached to the PC. Devices that don't conform are considered
legacy devices.
The basic procedure for plug and play is a three step process. First, the system checks what
resources are needed for each expansion device. Next, the system coordinates assignments
to IRQ's, DMA's and I/O Ports to avoid conflicts and finally, the system tells the software
what choices it has made. In order to do this, the BIOS calls upon specific features of a plug
and play expansion board. To achieve this, the expansion board must be able to deactivate
itself from normal control signals to avoid conflicts with other devices. In addition to this,
each expansion board has registers that are accessed through standard I/O port addresses
so the BIOS and operating system can configure the board. These ports are Address, Write
Data and Read Data.
The Address port functions like a pointer the expands the control registers accessible to your
system without stealing more system resources. The plug and play specification defines
eight card control registers and two large ranges. One range of 24 registers for future
expansion of the standard and another 16 registers for board makers for their own purposes.
The Address port allows the Write Data port to choose which logical devices reactive and the
resources used by them. Some boards, such as video adapters and disk controller cards,
start up active because they are needed at bootup. Other devices, such as sound cards,
modems and such come up inactive during boot and wait to be configured for use by the
operating system. Typically, any board that starts up inactive, stays this way until
specifically activated by the operating system. Every plug and play board has specific
circuitry that handles this configuration process, always monitoring the signals of the bus.
Every plug and play devices operates in four states. Wait for Key, Isolation, Configuration
and Sleep.
All plug and play devices, whether inactive or active, bootup in their Wait for Key state. In
this state, each board will refuse to respond until they receive the Initialization Key. This
Initialization key is a 32-step process between the host system and each expansion board. In
order for the initialization to be successful, this 32-step must be correct. Once this is
successful, the expansion board shifts itself into Sleep.
BIOS and Boot Sequences

Phoenix
American Megatrends
Technologies
1 Disable the NMI Check the CPU
2 Power-on Delay Test CMOS RAM
BIOS ROM
3 Initialize chipsets
checksum
4 Reset determination Test chipset(s)
5 BIOS ROM checksum Test PIT
6 Keyboard test Test DMA
Test base 64KB
7 CMOS shutdown check
memory
Check serial and
8 Controller disable
parallel ports
9 Disable video Test PIC
Check the
10 Detect memory Keyboard
Controller (KBC)
11 PIT test Verify CMOS data
Verify video
12 Check memory refresh
system
13 Check low address lines Test RTC
Test CPU in
14 Check low 64KB RAM
protected mode
15 Initialize support IC's Verify PIC 2
16 Load INT vector table Check NMI
Check the
17 Check the Keyboard Controller (KBC)
keyboard
18 Video tests Check the mouse
19 Load the BDA Check system RAM
20 Test memory Test disk controller
Set shadow RAM
21 Check DMA registers
areas
Check extended
22 Check the keyboard
ROMs
23 Perform high-level tests Test cache
controller
24 Load the OS Test CPU cache
Check hardware
25
adapters
26 Load the OS

BIOS Manufacturers
There are a number of BIOS manufacturers, but the three leader's are Phoenix Software,
American Megatrends and Award Software. Each of these BIOS manufacturers produce
BIOS's for PC's and each has it's strong points and weaknesses. It is not my goal or intention
to lean towards one manufacturer over the other. One of my PC's has an Award BIOS and
the other has a AMI BIOS, so these are the BIOS's in which I will be using for most of this
discussion.
The BIOS Functions
The BIOS is comprised of several independent functions or routines that are distinct from
one another. Even though these routines are separate and distinct from one another, they
get stored in the same memory location. The BIOS is way to refer to each of these separate
functions as a entire group. There are functions that test the computer, routines to let
software take control, and PnP (in some) to determine which peripherals are installed and
that these components do not conflict with one another in I/O activities and memory
allocation.

BIOS Basics

The BIOS (Basic Input Output System) is the program that enables a PC to boot after power-up.
The BIOS is a built-in set of routines that serve as an interface between the computer's
operating system and hardware devices. It is stored on an ROM chip generally located near the
computer's real-time clock or lithium battery. By processing requests from applications as well
as drivers, the BIOS permits the user to maintain control of hardware settings.
BIOS manufacturers
• American Megatrends, Inc. (AMI)
• Award Software
• Microid Research
• Phoenix Technologies
Identifying your motherboard by BIOS number
It can often be difficult to determine the manufacturer of your motherboard due to poor or incomplete
documentation. If you are contemplating a BIOS upgrade, it is imperative that you know the
manufacturer of your motherboard in order to obtain the correct upgrade. Fortunately, if your system
uses Award or AMI BIOS, there is a unique number that identifies the manufacturer. After turning on the
computer, the BIOS version number should be displayed at the bottom of the screen during memory
count. You can also use our Mobo ID Tools to look up BIOS strings, or you can use the links below to
look up your board:
• Wim's Award BIOS Lookup Page
• Wim's AMI BIOS Lookup Page
ROM BIOS upgrade vendors
• Micro Firmware, Inc.
• Midco Computers.
• Unicore Software
Other BIOS resources
The links below lead to excellent references for anyone who needs a complete overview of BIOS
features and functions, as well as useful insight into system BIOS setup, upgrades, and optimization.
• PC Mechanic BIOS Guide
• PC Guide System BIOS
• BIOS Optimization Guide
• The BIOS Companion
DDR3
In computing, DDR3 SDRAM or double-data-rate three SDRAM is a modern kind of
dynamic random access memory with a high bandwidth interface. It is one of several variants of
dynamic RAM and/or associated interface techniques that have been used since the early 1970s,
and it is not directly compatible with any of the earlier types, not even with DDR2 SDRAM.
This is due to different signaling voltages, timings, and other factors.
DDR3 is a DRAM interface specification; the actual DRAM arrays that store the data are similar
to earlier types, with similar performance.
The primary benefit of DDR3 SDRAM over its predecessor, DDR2 SDRAM, is the ability to
transfer at twice the data rate (8× the speed of its internal memory arrays), enabling higher
bandwidth or peak data rates. In addition, the DDR3 standard allows for chip capacities of up to
8 gigabits, thus enabling a memory module size of 16 gigabytes (using 16 chips).
With two transfers per cycle of a quadrupled clock, a 64-bit wide DDR3 module may achieve a
transfer rate of up to 64 times the memory clock in MB/s.
Contents
[hide]
• 1 Overview
○ 1.1 Latencies
• 2 Extensions
• 3 Modules
○ 3.1 JEDEC standard modules
• 4 Feature summary
• 5 Market penetration
• 6 Successor
• 7 See also
• 8 References
[edit] Overview

DDR, DDR2 and DDR3 for Desktop PCs


DDR3 memory provides a reduction in power consumption of 30% compared to DDR2 modules
due to DDR3's 1.5 V supply voltage, compared to DDR2's 1.8 V or DDR's 2.5 V. The 1.5 V
supply voltage works well with the 90 nanometer fabrication technology used in the original
DDR3 chips. Some manufacturers further propose using "dual-gate" transistors to reduce leakage
of current.[1]
According to JEDEC[2] the maximum recommended voltage is 1.575 volts and should be
considered the absolute maximum when memory stability is the foremost consideration, such as
in servers or other mission critical devices. In addition, JEDEC states that memory modules must
withstand up to 1.975 volts before incurring permanent damage, although they are not required to
function correctly at that level.
DDR3L standard for low voltage DDR3 memory introduced by JEDEC, such as DDR3L‐800,
DDR3L‐1066, DDR3L‐1333, DDR3L‐1600. DDR3L operated with default voltage 1.35 V. It
basically uses up at least 15% less power than DDR3 and 40% compared to DDR2. Modules
with DDR3L labeled ’’PC3L’’.
The main benefit of DDR3 comes from the higher bandwidth made possible by DDR3's 8-burst-
deep prefetch buffer, in contrast to DDR2's 4-burst-deep or DDR's 2-burst-deep prefetch buffer.
DDR3 modules can transfer data at a rate of 800–2133 MT/s using both rising and falling edges
of a 400–1066 MHz I/O clock. Sometimes, a vendor may misleadingly advertise the I/O clock
rate by labeling the MT/s as MHz. The MT/s is normally twice that of MHz by double sampling,
one on the rising clock edge, and the other, on the falling. In comparison, DDR2's current range
of data transfer rates is 400–1066 MT/s using a 200–533 MHz I/O clock, and DDR's range is
200–400 MT/s based on a 100–200 MHz I/O clock. High-performance graphics was an initial
driver of such bandwidth requirements, where high bandwidth data transfer between
framebuffers is required.
DDR3 prototypes were announced in early 2005. Products in the form of motherboards appeared
on the market in June 2007[3] based on Intel's P35 "Bearlake" chipset with DIMMs at bandwidths
up to DDR3-1600 (PC3-12800).[4] The Intel Core i7, released in November 2008, connects
directly to memory rather than via a chipset. The Core i7 supports only DDR3. AMD's first
socket AM3 Phenom II X4 processors, released in February 2009, were their first to support
DDR3.
DDR3 DIMMs have 240 pins, are electrically incompatible with DDR2 and have a different key
notch location.[5] DDR3 SO-DIMMs have 204 pins.[6]
GDDR3 memory, having a similar name but being from an entirely dissimilar technology, has
been in use for graphic cards. GDDR3 has sometimes been incorrectly referred to as "DDR3".
[edit] Latencies
While the typical latencies for a JEDEC DDR2 device were 5-5-5-15, some standard latencies
for JEDEC DDR3 devices include 7-7-7-20 for DDR3-1066 and 8-8-8-24 for DDR3-1333.
DDR3 latencies are numerically higher because the I/O bus clock cycles by which they are
measured are shorter; the actual time interval is similar to DDR2 latencies (around 10 ns). There
is some improvement because DDR3 generally uses more recent manufacturing processes, but
this is not directly caused by the change to DDR3.
As with earlier memory generations, faster DDR3 memory became available after the release of
the initial versions. DDR3-2000 memory with 9-9-9-28 latency (9 ns) was available in time to
coincide with the Intel Core i7 release.[7] CAS latency of 9 at 1000 MHz (DDR3-2000) is 9 ns,
while CAS latency of 7 at 667 MHz (DDR3-1333) is 10.5 ns.
(CAS / Frequency (MHz)) × 1000 = X ns
Example:
(7 / 667) × 1000 = 10.4948 ns
[edit] Extensions
Intel Corporation officially introduced the eXtreme Memory Profile (XMP) Specification on
March 23, 2007 to enable enthusiast performance extensions to the traditional JEDEC SPD
specifications for DDR3 SDRAM.[8]
[edit] Modules
[edit] JEDEC standard modules
Peak
Standard Memory Cycle I/O bus Data Module Timings
transfer
name clock time clock rate name
rate (CL-nRCD-
(MHz) (ns) (MHz) (MT/s) nRP)
(MB/s)
DDR3-800D 5-5-5
100 10 400 800 PC3-6400 6400
DDR3-800E 6-6-6

DDR3-
1066E
6-6-6
DDR3- 1
133 7 /2 533 1066 PC3-8500 8533 7-7-7
1066F
8-8-8
DDR3-
1066G

DDR3-
1333F*
DDR3- 7-7-7
1333G PC3- 8-8-8
166 6 667 1333 10667
DDR3- 10600 9-9-9
1333H 10-10-10
DDR3-
1333J*

DDR3- 200 5 800 1600 PC3- 12800 8-8-8


1600G* 12800 9-9-9
DDR3- 10-10-10
1600H 11-11-11
DDR3-
1600J
DDR3-
1600K

DDR3-
1866J*
DDR3- 10-10-10
1866K PC3- 11-11-11
233 42 /7 933 1866 14933
DDR3- 14900 12-12-12
1866L 13-13-13
DDR3-
1866M*

DDR3-
2133K*
DDR3- 11-11-11
2133L PC3- 12-12-12
266 33 /4 1066 2133 17066
DDR3- 17000 13-13-13
2133M 14-14-14
DDR3-
2133N*

CL - Clock cycles between read or write and data output


nRCD - Clock cycles between activate and read or write
nRP - Clock cycles between precharge and activate
* optional
Note: All above listed are specified by JEDEC as JESD79-3D.[9] All RAM data rates in-between
or above these listed specifications are not standardized by JEDEC—often they are simply
manufacturer optimizations using higher-tolerance or overvolted chips. Of these non-standard
specifications, the highest reported speed reached was equivalent to DDR3-2544 as of May
2010.[10]
DDR3-xxx denotes data transfer rate, and describes raw DDR chips, whereas PC3-xxxx denotes
theoretical bandwidth (with the last two digits truncated), and is used to describe assembled
DIMMs. Bandwidth is calculated by taking transfers per second and multiplying by eight. This is
because DDR3 memory modules transfer data on a bus that is 64 data bits wide, and since a byte
comprises 8 bits, this equates to 8 bytes of data per transfer.
In addition to bandwidth and capacity variants, modules can
1. Optionally implement ECC, which is an extra data byte lane used for
correcting minor errors and detecting major errors for better reliability.
Modules with ECC are identified by an additional ECC or E in their
designation. For example: "PC3-6400 ECC", or PC3-8500E.[11]
2. Be "registered", which improves signal integrity (and hence potentially clock
rates and physical slot capacity) by electrically buffering the signals with a
register, at a cost of an extra clock of increased latency. Those modules are
identified by an additional R in their designation, whereas non-registered
(a.k.a. "unbuffered") RAM may be identified by an additional U in the
designation. PC3-6400R is a registered PC3-6400 module, PC3-6400R ECC is
the same module but with additional ECC.
3. Be fully buffered modules, which are designated by F or FB and do not have
the same notch position as other classes. Fully buffered modules cannot be
used with motherboards that are made for registered modules, and the
different notch position physically prevents their insertion.
[edit] Feature summary
DDR3 SDRAM components

• Introduction of asynchronous RESET pin


• Support of system-level flight-time compensation
• On-DIMM mirror-friendly DRAM pinout
• Introduction of CWL (CAS write latency) per clock bin
• On-die I/O calibration engine
• READ and WRITE calibration
DDR3 modules
• Fly-by command/address/control bus with on-DIMM termination
• High-precision calibration resistors
• Are not backwards compatible—DDR3 modules do not fit into DDR2 sockets;
forcing them can damage the DIMM and/or the motherboard[12]
Technological advantages compared to DDR2
• Higher bandwidth performance, up to 2133 MT/s standardized
• Slightly improved latencies as measured in nanoseconds
• Higher performance at low power (longer battery life in laptops)
• Enhanced low-power features
[edit] Market penetration
Although DDR3 was launched in 2007, DDR3 sales were not expected to overtake DDR2 until
the end of 2009, or possibly early 2010, according to Intel strategist Carlos Weissenberg,
speaking during the early part of their roll-out in August 2008[13] (the same view had been stated
by market intelligence company DRAMeXchange over a year earlier in April 2007.[14]) The
primary driving force behind the increased usage of DDR3 has been new Core i7 processors
from Intel and Phenom II processors from AMD, both of which have internal memory
controllers: the latter recommends DDR3, the former requires it. IDC stated in January 2009 that
DDR3 sales will account for 29 percent of the total DRAM units sold in 2009, rising to 72% by
2011.[15]
[edit] Successor
Main article: DDR4 SDRAM

It was revealed at the Intel Developer Forum in San Francisco 2008 that the successor to DDR3
will be known as DDR4. It is currently in the design stage, and is expected to be released in
2012.[16] When released, it is expected to run at 1.2 volts or less,[17][18] versus the 1.5 volts of
DDR3 chips and have in excess of 2 billion data transfers per second.
[edit] See also
• Dual-channel architecture
• Triple-channel architecture
• List of device bandwidths

DDR
From Wikipedia, the free encyclopedia

Jump to: navigation, search

Generic DDR-266 Memory in the 184pin DIMM form

Corsair DDR-400 Memory with heat spreaders

Double data rate synchronous dynamic random access memory (DDR SDRAM) is a class of
memory integrated circuits used in computers. DDR SDRAM (sometimes referred to as DDR1
SDRAM) has been superseded by DDR2 SDRAM and DDR3 SDRAM, neither of which are
either forward or backward compatible with DDR SDRAM, meaning that DDR2 or DDR3
memory modules will not work in DDR equipped motherboards, and vice versa.
Compared to single data rate (SDR) SDRAM, the DDR SDRAM interface makes higher transfer
rates possible by more strict control of the timing of the electrical data and clock signals.
Implementations often have to use schemes such as phase-locked loops and self-calibration to
reach the required timing accuracy.[1][2] The interface uses double pumping (transferring data on
both the rising and falling edges of the clock signal) to lower the clock frequency. One advantage
of keeping the clock frequency down is that it reduces the signal integrity requirements on the
circuit board connecting the memory to the controller. The name "double data rate" refers to the
fact that a DDR SDRAM with a certain clock frequency achieves nearly twice the bandwidth of
a single data rate (SDR) SDRAM running at the same clock frequency, due to this double
pumping.
With data being transferred 64 bits at a time, DDR SDRAM gives a transfer rate of (memory bus
clock rate) × 2 (for dual rate) × 64 (number of bits transferred) / 8 (number of bits/byte). Thus,
with a bus frequency of 100 MHz, DDR SDRAM gives a maximum transfer rate of 1600 MB/s.
"Beginning in 1996 and concluding in June 2000, JEDEC developed the DDR (Double Data
Rate) SDRAM specification (JESD79)."[3] JEDEC has set standards for data rates of DDR
SDRAM, divided into two parts. The first specification is for memory chips, and the second is
for memory modules.

SDRAM
Synchronous dynamic random access memory (SDRAM) is dynamic random access memory
(DRAM) that is synchronized with the system bus. Classic DRAM has an asynchronous
interface, which means that it responds as quickly as possible to changes in control inputs.
SDRAM has a synchronous interface, meaning that it waits for a clock signal before responding
to control inputs and is therefore synchronized with the computer's system bus. The clock is used
to drive an internal finite state machine that pipelines incoming instructions. This allows the chip
to have a more complex pattern of operation than an asynchronous DRAM, enabling higher
speeds.
Pipelining means that the chip can accept a new instruction before it has finished processing the
previous one. In a pipelined write, the write command can be immediately followed by another
instruction without waiting for the data to be written to the memory array. In a pipelined read,
the requested data appears after a fixed number of clock pulses after the read instruction, cycles
during which additional instructions can be sent. (This delay is called the latency and is an
important parameter to consider when purchasing SDRAM for a computer.)
SDRAM is widely used in computers; from the original SDRAM, further generations of DDR
(or DDR1) and then DDR2 and DDR3 have entered the mass market, with DDR4 currently being
designed and anticipated to be available in 2015.
Contents
[hide]
• 1 SDRAM history
• 2 SDRAM timing
• 3 SDR SDRAM
○ 3.1 SDRAM control signals
○ 3.2 SDRAM operation
○ 3.3 Command interactions
 3.3.1 Interrupting a read burst
○ 3.4 SDRAM burst ordering
○ 3.5 SDRAM mode register
○ 3.6 Auto refresh
○ 3.7 Low power modes
• 4 Generations of SDRAM
○ 4.1 SDRAM (synchronous DRAM)
○ 4.2 DDR SDRAM (sometimes called DDR1)
○ 4.3 DDR2 SDRAM
○ 4.4 DDR3 SDRAM
○ 4.5 DDR4 SDRAM
○ 4.6 Feature map
• 5 Failed successors
○ 5.1 Rambus DRAM (RDRAM)
○ 5.2 Synchronous-Link DRAM (SLDRAM)
○ 5.3 Virtual Channel Memory (VCM) SDRAM
• 6 See also
• 7 References
[edit] SDRAM history
Eight SDRAM ICs on a PC100 DIMM package.

Although the concept of synchronous DRAM has been known since at least the 1970s and was
used with early Intel processors, it was only in 1993 that SDRAM began its path to universal
acceptance in the electronics industry. In 1993, Samsung introduced its KM48SL2000
synchronous DRAM, and by 2000, SDRAM had replaced virtually all other types of DRAM in
modern computers, because of its greater performance.
SDRAM latency is not inherently lower (faster) than asynchronous DRAM. Indeed, early
SDRAM was somewhat slower than contemporaneous burst EDO DRAM due to the additional
logic. The benefits of SDRAM's internal buffering come from its ability to interleave operations
to multiple banks of memory, thereby increasing effective bandwidth.
Today, virtually all SDRAM is manufactured in compliance with standards established by
JEDEC, an electronics industry association that adopts open standards to facilitate
interoperability of electronic components. JEDEC formally adopted its first SDRAM standard in
1993 and subsequently adopted other SDRAM standards, including those for DDR, DDR2 and
DDR3 SDRAM.
SDRAM is also available in registered varieties, for systems that require greater scalability such
as servers and workstations.
As of 2007[update], 168-pin SDRAM DIMMs are not used in new PC systems, and 184-pin DDR
memory has been mostly superseded. DDR2 SDRAM is the most common type used with new
PCs, and DDR3 motherboards and memory are widely available, and less expensive than still-
popular DDR2 products.
Today, the world's largest manufacturers of SDRAM include: Samsung Electronics, Panasonic,
Micron Technology, and Hynix.
[edit] SDRAM timing
There are several limits on DRAM performance. Most noted is the read cycle time, the time
between successive read operations to an open row. This time decreased from 10 ns for 100 MHz
SDRAM to 5 ns for DDR-400, but has remained relatively unchanged through DDR2-800 and
DDR3-1600 generations. However, by operating the interface circuitry at increasingly higher
multiples of the fundamental read rate, the achievable bandwidth has increased rapidly.
Another limit is the CAS latency, the time between supplying a column address and receiving the
corresponding data. Again, this has remained relatively constant at 10–15 ns through the last few
generations of DDR SDRAM.
In operation, CAS latency is a specific number of clock cycles programmed into the SDRAM's
mode register and expected by the DRAM controller. Any value may be programmed, but the
SDRAM will not operate correctly if it is too low. At higher clock rates, the useful CAS latency
in clock cycles naturally increases. 10–15 ns is 2–3 cycles (CL2–3) of the 200 MHz clock of
DDR-400 SDRAM, CL4-6 for DDR2-800, and CL8-12 for DDR3-1600. Slower clock cycles
will naturally allow lower numbers of CAS latency cycles.
SDRAM modules have their own timing specifications, which may be slower than those of the
chips on the module. When 100 MHz SDRAM chips first appeared, some manufacturers sold
"100 MHz" modules that could not reliably operate at that clock rate. In response, Intel published
the PC100 standard, which outlines requirements and guidelines for producing a memory module
that can operate reliably at 100 MHz. This standard was widely influential, and the term "PC100"
quickly became a common identifier for 100 MHz SDRAM modules, and modules are now
commonly designated with "PC"-prefixed numbers (PC66, PC100 or PC133 - although the
actual meaning of the numbers has changed).
[edit] SDR SDRAM

64 MB sound memory of Sound Blaster X-Fi Fatal1ty Pro uses two Micron
48LC32M8A2-75 C SDRAM chips working at 133 MHz (7.5 ns) 8-bit wide [1]

Originally simply known as SDRAM, single data rate SDRAM can accept one command and
transfer one word of data per clock cycle. Typical clock frequencies are 100 and 133 MHz. Chips
are made with a variety of data bus sizes (most commonly 4, 8 or 16 bits), but chips are generally
assembled into 168-pin DIMMs that read or write 64 (non-ECC) or 72 (ECC) bits at a time.
Use of the data bus is intricate and thus requires a complex DRAM controller circuit. This is
because data written to the DRAM must be presented in the same cycle as the write command,
but reads produce output 2 or 3 cycles after the read command. The DRAM controller must
ensure that the data bus is never required for a read and a write at the same time.
Typical SDR SDRAM clock rates are 66, 100, and 133 MHz (periods of 15, 10, and 7.5 ns).
Clock rates up to 150 MHz were available for performance enthusiasts.
[edit] SDRAM control signals
All commands are timed relative to the rising edge of a clock signal. In addition to the clock,
there are 6 control signals, mostly active low, which are sampled on the rising edge of the clock:
• CKE Clock Enable. When this signal is low, the chip behaves as if the clock
has stopped. No commands are interpreted and command latency times do
not elapse. The state of other control lines is not relevant. The effect of this
signal is actually delayed by one clock cycle. That is, the current clock cycle
proceeds as usual, but the following clock cycle is ignored, except for testing
the CKE input again. Normal operations resume on the rising edge of the
clock after the one where CKE is sampled high.
Put another way, all other chip operations are timed relative to the rising
edge of a masked clock. The masked clock is the logical AND of the input
clock and the state of the CKE signal during the previous rising edge of the
input clock.
• /CS Chip Select. When this signal is high, the chip ignores all other inputs
(except for CKE), and acts as if a NOP command is received.
• DQM Data Mask. (The letter Q appears because, following digital logic
conventions, the data lines are known as "DQ" lines.) When high, these
signals suppress data I/O. When accompanying write data, the data is not
actually written to the DRAM. When asserted high two cycles before a read
cycle, the read data is not output from the chip. There is one DQM line per 8
bits on a x16 memory chip or DIMM.
• /RAS Row Address Strobe. Despite the name, this is not a strobe, but rather
simply a command bit. Along with /CAS and /WE, this selects one of 8
commands.
• /CAS Column Address Strobe. Despite the name, this is not a strobe, but
rather simply a command bit. Along with /RAS and /WE, this selects one of 8
commands.
• /WE Write enable. Along with /RAS and /CAS, this selects one of 8 commands.
This generally distinguishes read-like commands from write-like commands.
SDRAM devices are internally divided into 2 or 4 independent internal data banks. One or two
bank address inputs (BA0 and BA1) select which bank a command is directed toward.
Many commands also use an address presented on the address input pins. Some commands,
which either do not use an address, or present a column address, also use A10 to select variants.
The commands understood are as follows.
/ /
/ / BA A1
RA CA An Command
CS WE n 0
S S

H x x x x x x Command inhibit (No operation)

L H H H x x x No operation

Burst Terminate: stop a burst read or burst write in


L H H L x x x
progress.

ban colum Read: Read a burst of data from the currently active
L H L H L
k n row.

ban colum Read with auto precharge: As above, and precharge


L H L H H
k n (close row) when done.

L H L L ban L colum Write: Write a burst of data to the currently active


k n row.

ban colum Write with auto precharge: As above, and precharge


L H L L H
k n (close row) when done.

ban Active (activate): open a row for Read and Write


L L H H row
k commands.

ban
L L H L L x Precharge: Deactivate current row of selected bank.
k

L L H L x H x Precharge all: Deactivate current row of all banks.

Auto refresh: Refresh one row of each bank, using an


L L L H x x x
internal counter. All banks must be precharged.

Load mode register: A0 through A9 are loaded to


configure the DRAM chip.
L L L L 00 mode
The most significant settings are CAS latency (2 or 3
cycles) and burst length (1, 2, 4 or 8 cycles)

The various DDRx SDRAM standards use essentially the same commands, with minor additions.
Additional mode registers are distinguished using the bank address bits, and a third bank address
bit is added.
[edit] SDRAM operation
A 512 MB SDRAM DIMM (which contains 512 MiB = 512 × 10242 bytes = 536,870,912 bytes
exactly) might be made of 8 or 9 SDRAM chips, each containing 512 Mbit of storage, and each
one contributing 8 bits to the DIMM's 64- or 72-bit width. A typical 512 Mbit SDRAM chip
internally contains 4 independent 16 Mbyte memory banks. Each bank is an array of 8,192 rows
of 16,384 bits each. A bank is either idle, active, or changing from one to the other.
The Active command activates an idle bank. It presents a 2-bit bank address (BA0–BA1) and a
13-bit row address (A0–A12), and causes a read of that row into the bank's array of all 16,384
column sense amplifiers. This is also known as "opening" the row. This operation has the side
effect of refreshing the dynamic (capacitive) memory storage cells of that row.
Once the row has been activated or "opened", Read and Write commands are possible to that
row. Activation requires a minimum amount of time, called the row-to-column delay, or tRCD
before reads or writes to it may occur. This time, rounded up to the next multiple of the clock
period, specifies the minimum number of wait cycles between an Active command, and a Read
or Write command. During these wait cycles, additional commands may be sent to other banks;
because each bank operates completely independently.
Both Read and Write commands require a column address. Because each chip accesses 8 bits of
data at a time, there are 2048 possible column addresses thus requiring only 11 address lines
(A0–A9, A11).
When a Read command is issued, the SDRAM will produce the corresponding output data on the
DQ lines in time for the rising edge of the clock 2 or 3 clock cycles later (depending on the
configured CAS latency). Subsequent words of the burst will be produced in time for subsequent
rising clock edges.
A Write command is accompanied by the data to be written driven on to the DQ lines during the
same rising clock edge. It is the duty of the memory controller to ensure that the SDRAM is not
driving read data on to the DQ lines at the same time that it needs to drive write data on to those
lines. This can be done by waiting until a read burst has finished, by terminating a read burst, or
by using the DQM control line.
When the memory controller needs to access a different row, it must first return that bank's sense
amplifiers to an idle state, ready to sense the next row. This is known as a "precharge" operation,
or "closing" the row. A precharge may be commanded explicitly, or it may be performed
automatically at the conclusion of a read or write operation. Again, there is a minimum time, the
row precharge delay, tRP, which must elapse before that bank is fully idle and it may receive
another activate command.
Although refreshing a row is an automatic side effect of activating it, there is a minimum time
for this to happen, which requires a minimum row access time tRAS delay between an Active
command opening a row, and the corresponding precharge command closing it. This limit is
usually dwarfed by desired read and write commands to the row, so its value has little effect on
typical performance.
[edit] Command interactions
The no operation command is always permitted.
The load mode register command requires that all banks be idle, and a delay afterward for the
changes to take effect.
The auto refresh command also requires that all banks be idle, and takes a refresh cycle time tRFC
to return the chip to the idle state. (This time is usually equal to tRCD+tRP.)
The only other command that is permitted on an idle bank is the active command. This takes, as
mentioned above, tRCD before the row is fully open and can accept read and write commands.
When a bank is open, there are four commands permitted: read, write, burst terminate, and
precharge. Read and write commands begin bursts, which can be interrupted by following
commands.
[edit] Interrupting a read burst
A read, burst terminate, or precharge command may be issued at any time after a read command,
and will interrupt the read burst after the configured CAS latency. So if a read command is
issued on cycle 0, another read command is issued on cycle 2, and the CAS latency is 3, then the
first read command will begin bursting data out during cycles 3 and 4, then the results from the
second read command will appear beginning with cycle 5.
If the command issued on cycle 2 were burst terminate, or a precharge of the active bank, then no
output would be generated during cycle 5.
Although the interrupting read may be to any active bank, a precharge command will only
interrupt the read burst if it is to the same bank or all banks; a precharge command to a different
bank will not interrupt a read burst.
To interrupt a read burst by a write command is possible, but more difficult. It can be done, if the
DQM signal is used to suppress output from the SDRAM so that the memory controller may
drive data over the DQ lines to the SDRAM in time for the write operation. Because the effects
of DQM on read data are delayed by 2 cycles, but the effects of DQM on write data are
immediate, DQM must be raised (to mask the read data) beginning at least two cycles before
write command, but must be lowered for the cycle of the write command (assuming you want the
write command to have an effect).
Doing this in only two clock cycles requires careful coordination between the time the SDRAM
takes to turn off its output on a clock edge and the time the data must be supplied as input to the
SDRAM for the write on the following clock edge. If the clock frequency is too high to allow
sufficient time, three cycles may be required.
If the read command includes auto-precharge, the precharge begins the same cycle as the
interrupting command.
[edit] SDRAM burst ordering
A modern microprocessor with a cache will generally access memory in units of cache lines. To
transfer a 64-byte cache line requires 8 consecutive accesses to a 64-bit DIMM, which can all be
triggered by a single read or write command by configuring the SDRAM chips, using the mode
register, to perform 8-word bursts.
A cache line fetch is typically triggered by a read from a particular address, and SDRAM allows
the "critical word" of the cache line to be transferred first. ("Word" here refers to the width of the
SDRAM chip or DIMM, which is 64 bits for a typical DIMM.) SDRAM chips support two
possible conventions for the ordering of the remaining words in the cache line.
Bursts always access an aligned block of BL consecutive words beginning on a multiple of BL.
So, for example, a 4-word burst access to any column address from 4 to 7 will return words 4–7.
The ordering, however, depends on the requested address, and the configured burst type option:
sequential or interleaved. Typically, a memory controller will require one or the other.
When the burst length is 1 or 2, the burst type does not matter. For a burst length of 1, the
requested word is the only word accessed. For a burst length of 2, the requested word is accessed
first, and the other word in the aligned block is accessed second. This is the following word if an
even address was specified, and the previous word if an odd address was specified.
For the sequential burst mode, later words are accessed in increasing address order, wrapping
back to the start of the block when the end is reached. So, for example, for a burst length of 4,
and a requested column address of 5, the words would be accessed in the order 5-6-7-4. If the
burst length were 8, the access order would be 5-6-7-0-1-2-3-4. This is done by adding a counter
to the column address, and ignoring carries past the burst length.
The interleaved burst mode computes the address using an exclusive or operation between the
counter and the address. Using the same starting address of 5, a 4-word burst would return words
in the order 5-4-7-6. An 8-word burst would be 5-4-7-6-1-0-3-2. Although more confusing to
humans, this can be easier to implement in hardware, and is preferred by Intel microprocessors.
If the requested column address is at the start of a block, both burst modes return data in the
same sequential sequence 0-1-2-3-4-5-6-7. The difference only matters if fetching a cache line
from memory in critical-word-first order.
[edit] SDRAM mode register
Single data rate SDRAM has a single 10-bit programmable mode register. Later double-data-rate
SDRAM standards add additional mode registers, addressed using the bank address pins. For
SDR SDRAM, the bank address pins and address lines A10 and above are ignored, but should be
zero during a mode register write.
The bits are M9 through M0, presented on address lines A9 through A0 during a load mode
register cycle.
1. M9: Write burst mode. If 0, writes use the read burst length and mode. If 1,
all writes are non-burst (single location).
2. M8, M7: Operating mode. Reserved, and must be 00.
3. M6, M5, M4: CAS latency. Generally only 010 (CL2) and 011 (CL3) are legal.
Specifies the number of cycles between a read command and data output
from the chip. The chip has a fundamental limit on this value in nanoseconds;
during initialization, the memory controller must use its knowledge of the
clock frequency to translate that limit into cycles.
4. M3: Burst type. 0 - requests sequential burst ordering, while 1 requests
interleaved burst ordering.
5. M2, M1, M0: Burst length. Values of 000, 001, 010 and 011 specify a burst
size of 1, 2, 4 or 8 words, respectively. Each read (and write, if M9 is 0) will
perform that many accesses, unless interrupted by a burst stop or other
command. A value of 111 specifies a full-row burst. The burst will continue
until interrupted. Full-row bursts are only permitted with the sequential burst
type.

[edit] Auto refresh


It is possible to refresh a RAM chip by opening and closing (activating and precharging) each
row in each bank. However, to simplify the memory controller, SDRAM chips support an "auto
refresh" command, which performs these operations to one row in each bank simultaneously.
The SDRAM also maintains an internal counter, which iterates over all possible rows. The
memory controller must simply issue a sufficient number of auto refresh commands (one per
row, 4096 in the example we have been using) every refresh interval (tREF = 64 ms is a common
value). All banks must be idle (closed, precharged) when this command is issued.
[edit] Low power modes
As mentioned, the clock enable (CKE) input can be used to effectively stop the clock to an
SDRAM. The CKE input is sampled each rising edge of the clock, and if it is low, the following
rising edge of the clock is ignored for all purposes other than checking CKE. As long as CKE is
low, it is permissible to change the clock rate, or even stop the clock entirely.
If CKE is lowered while the SDRAM is performing operations, it simply "freezes" in place until
CKE is raised again.
If the SDRAM is idle (all banks precharged, no commands in progress) when CKE is lowered,
the SDRAM automatically enters power-down mode, consuming minimal power until CKE is
raised again. This must not last longer than the maximum refresh interval tREF, or memory
contents may be lost. It is legal to stop the clock entirely during this time for additional power
savings.
Finally, if CKE is lowered at the same time as an auto-refresh command is sent to the SDRAM,
the SDRAM enters self-refresh mode. This is like power down, but the SDRAM uses an on-chip
timer to generate internal refresh cycles as necessary. The clock may be stopped during this time.
While self-refresh mode consumes slightly more power than power-down mode, it allows the
memory controller to be disabled entirely, which commonly more than makes up the difference.
SDRAM designed for battery-powered devices offers some additional power-saving options.
One is temperature-dependent refresh; an on-chip temperature sensor reduces the refresh rate at
lower temperatures, rather than always running it at the worst-case rate. Another is selective
refresh, which limits self-refresh to a portion of the DRAM array. The fraction which is
refreshed is configured using an extended mode register. The third, implemented in Mobile DDR
(LPDDR) and LPDDR2 is "deep power down" mode, which invalidates the memory and requires
a full reinitialization to exit from. This is activated by sending a "burst terminate" command
while lowering CKE.
[edit] Generations of SDRAM
[edit] SDRAM (synchronous DRAM)
This type of SDRAM is slower than the DDR variants, because only one word of data is
transmitted per clock cycle (single data rate).
[edit] DDR SDRAM (sometimes called DDR1)
Main article: DDR SDRAM

While the access latency of DRAM is fundamentally limited by the DRAM array, DRAM has
very high potential bandwidth because each internal read is actually a row of many thousands of
bits. To make more of this bandwidth available to users, a double data rate interface was
developed. This uses the same commands, accepted once per cycle, but reads or writes two
words of data per clock cycle. The DDR interface accomplishes this by reading and writing data
on both the rising and falling edges of the clock signal. In addition, some minor changes to the
SDR interface timing were made in hindsight, and the supply voltage was reduced from 3.3 to
2.5 V. As a result, DDR SDRAM is not backwards compatible with SDR SDRAM.[2]
DDR SDRAM (sometimes called DDR1 for greater clarity) doubles the minimum read or write
unit; every access refers to at least two consecutive words.
Typical DDR SDRAM clock rates are 133, 166 and 200 MHz (7.5, 6, and 5 ns/cycle), generally
described as DDR-266, DDR-333 and DDR-400 (3.75, 3, and 2.5 ns per beat). Corresponding
184-pin DIMMS are known as PC-2100, PC-2700 and PC-3200. Performance up to DDR-550
(PC-4400) is available for a price.
[edit] DDR2 SDRAM
Main article: DDR2 SDRAM

DDR2 SDRAM is very similar to DDR SDRAM, but doubles the minimum read or write unit
again, to 4 consecutive words. The bus protocol was also simplified to allow higher performance
operation. (In particular, the "burst terminate" command is deleted.) This allows the bus rate of
the SDRAM to be doubled without increasing the clock rate of internal RAM operations; instead,
internal operations are performed in units 4 times as wide as SDRAM. Also, an extra bank
address pin (BA2) was added to allow 8 banks on large RAM chips.
Typical DDR2 SDRAM clock rates are 200, 266, 333 or 400 MHz (periods of 5, 3.75, 3 and
2.5 ns), generally described as DDR2-400, DDR2-533, DDR2-667 and DDR2-800 (periods of
2.5, 1.875, 1.5 and 1.25 ns). Corresponding 240-pin DIMMS are known as PC2-3200 through
PC2-6400. DDR2 SDRAM is now available at a clock rate of 533 MHz generally described as
DDR2-1066 and the corresponding DIMMS are known as PC2-8500 (also named PC2-8600
depending on the manufacturer). Performance up to DDR2-1250 (PC2-10000) is available for a
price.
Note that because internal operations are at 1/2 the clock rate, DDR2-400 memory (internal
clock rate 100 MHz) has somewhat higher latency than DDR-400 (internal clock rate 200 MHz).
[edit] DDR3 SDRAM
Main article: DDR3 SDRAM

DDR3 continues the trend, doubling the minimum read or write unit to 8 consecutive words.
This allows another doubling of bandwidth and external bus rate without having to change the
clock rate of internal operations, just the width. To maintain 800–1600 M transfers/s (both edges
of a 400–800 MHz clock), the internal RAM array has to perform 100–200 M fetches per
second.
Again, with every doubling, the downside is the increased latency. As with all DDR SDRAM
generations, commands are still restricted to one clock edge and command latencies are given in
terms of clock cycles, which are half the speed of the usually quoted transfer rate (a CAS latency
of 8 with DDR3-800 is 8/(400 MHz) = 20 ns, exactly the same latency of CAS2 on PC100 SDR
SDRAM).
DDR3 memory chips are being made commercially,[3] and computer systems are available that
use them as of the second half of 2007,[4] with expected significant usage in 2008.[5] Initial clock
rates were 400 and 533 MHz, which are described as DDR3-800 and DDR3-1066 (PC3-6400
and PC3-8500 modules), but 667 and 800 MHz, described as DDR3-1333 and DDR3-1600
(PC3-10600 and PC3-12800 modules) are now common.[6] Performance up to DDR3-2200 is
available for a price.[7]
[edit] DDR4 SDRAM
DDR4 SDRAM will be the successor to DDR3 SDRAM. It was revealed at the Intel Developer
Forum in San Francisco, 2008, and is currently in the design state and was originally expected to
be released in 2012.[8] It is now expected to be released in 2015.[9]
The new chips are expected to run at 1.2 V or less,[10][11] versus the 1.5 V of DDR3 chips, and
have in excess of 2 billion data transfers per second. They are expected to be introduced at
transfer rates of 2133 MT/s, estimated to rise to a potential 4266 MT/s (4.26 GT/s) [12] and
lowered voltage of 1.0 V [13] by 2013.
In February 2009, Samsung validated 40 nm DRAM chips, considered a "significant step"
towards DDR4 development.[14] As of 2009, current DRAM chips are only migrating to a 50 nm
process.[15]
In January 2011, Samsung announced the development of a 2GB DDR4 DRAM module. It has a
maximum bandwidth of 2.13Gbps at 1.2V, uses pseudo open drain technology on a 30nm class
process technology, and draws 40% less power than an equivalent DDR3 module.[16] [17]
[edit] Feature map
Type Feature changes
SDRA Vcc = 3.3 V
M Signal: LVTTL

Access is ≥2 words
Double clocked
DDR1 Vcc = 2.5 V
2.5 - 7.5 ns per cycle
Signal: SSTL_2 (2.5V)[18]

Access is ≥4 words
"Burst terminate" removed
4 units used in parallel
DDR2 1.25 - 5 ns per cycle
Internal operations are at 1/2 the
clock rate.
Signal: SSTL_18 (1.8V)[18]

Access is ≥8 words
DDR3 Signal: SSTL_15 (1.5V)[18]
Much longer CAS latencies

DDR4 Vcc ≤ 1.2 V

[edit] Failed successors


In addition to DDR, there were several other proposed memory technologies to succeed SDR
SDRAM.
[edit] Rambus DRAM (RDRAM)
RDRAM was a proprietary technology that competed against DDR. Its relatively high price and
disappointing performance (resulting from high latencies and a narrow 16-bit data channel versus
DDR's 64 bit channel) caused it to lose the race to succeed SDR DRAM.
[edit] Synchronous-Link DRAM (SLDRAM)
SLDRAM boasted higher performance and competed against RDRAM. It was developed during
the late 1990s by the SLDRAM Consortium, which consisted of about 20 major computer
industry manufacturers. It is an open standard and does not require licensing fees. The
specifications called for a 64-bit bus running at a 200 MHz clock frequency. This is achieved by
all signals being on the same line and thereby avoiding the synchronization time of multiple
lines. Like DDR SDRAM, SLDRAM uses an double-pumped bus, giving it an effective speed of
400 MT/s.[19]
[edit] Virtual Channel Memory (VCM) SDRAM
VCM was a proprietary type of SDRAM that was designed by NEC, but released as an open
standard with no licensing fees. VCM creates a state in which the various system processes can
be assigned their own virtual channel, thus increasing the overall system efficiency by avoiding
the need to have processes share buffer space. This is accomplished by creating different
"blocks" of memory, allowing each individual memory block to interface separately with the
memory controller and have its own buffer space. VCM has higher performance than SDRAM
because it has significantly lower latencies. The technology was a potential competitor of
RDRAM because VCM was not nearly as expensive as RDRAM was. A VCM module is
mechanically and electrically compatible with standard SDRAM, but must be recognized by the
memory controller. Few motherboards were ever produced with VCM support.
[edit] See also
• SDRAM latency
• List of device bandwidths
[edit] References
1. ^ "SDRAM Part Catalog".
http://www.micron.com/products/dram/sdram/partlist. 070928 micron.com
2. ^ [1]
3. ^ "What is DDR memory?".
http://www.simmtester.com/page/news/showpubnews.asp?num=145.
4. ^ Thomas Soderstrom (June 5, 2007). "Pipe Dreams: Six P35-DDR3
Motherboards Compared". Tom's Hardware.
http://www.tomshardware.com/2007/06/05/pipe_dreams_six_p35-
ddr3_motherboards_compared/.
5. ^ "AMD to Adopt DDR3

DDR2
DDR2 SDRAM is a double data rate synchronous dynamic random access memory interface. It
supersedes the original DDR SDRAM specification and has itself been superseded by DDR3
SDRAM. DDR2 is neither forward nor backward compatible with either DDR or DDR3,
meaning that DDR2 memory modules will not work in DDR or DDR3 equipped motherboards
and vice versa.
In addition to double pumping the data bus as in DDR SDRAM (transferring data on the rising
and falling edges of the bus clock signal), DDR2 allows higher bus speed and requires lower
power by running the internal clock at half the speed of the data bus. The two factors combine to
require a total of four data transfers per internal clock cycle. With data being transferred 64 bits
at a time, DDR2 SDRAM gives a transfer rate of (memory clock rate) × 2 (for bus clock
multiplier) × 2 (for dual rate) × 64 (number of bits transferred) / 8 (number of bits/byte). Thus
with a memory clock frequency of 100 MHz, DDR2 SDRAM gives a maximum transfer rate of
3200 MB/s.
Since the DDR2 internal clock runs at half the DDR external clock rate, DDR2 memory
operating at the same external data bus clock rate as DDR results in DDR2 being able to provide
the same bandwidth but with higher latency. Consequently, DDR2 RAM possesses inferior
performance. Alternatively, DDR2 memory operating at twice the external data bus clock rate as
DDR may provide twice the bandwidth with the same latency. The best-rated DDR2 memory
modules are at least twice as fast as the best-rated DDR memory modules.
Contents
[hide]
• 1 Overview
• 2 Specification standards
○ 2.1 Chips and modules
• 3 Debut
• 4 Backward compatibility
• 5 See also
• 6 References
• 7 Further reading
• 8 External links
[edit] Overview
Like all SDRAM implementations, DDR2 stores memory in memory cells that are activated with
the use of a clock signal to synchronize their operation with an external data bus. Like DDR
before it, the DDR2 I/O buffer transfers data both on the rising and falling edges of the clock
signal (a technique called "double pumping"). The key difference between DDR and DDR2 is
that for DDR2 the memory cells are clocked at 1 quarter (rather than half) the rate of the bus.
This requires a 4-bit-deep prefetch queue, but, without changing the memory cells themselves,
DDR2 can effectively operate at twice the bus speed of DDR.
DDR2's bus frequency is boosted by electrical interface improvements, on-die termination,
prefetch buffers and off-chip drivers. However, latency is greatly increased as a trade-off. The
DDR2 prefetch buffer is 4 bits deep, whereas it is two bits deep for DDR and eight bits deep for
DDR3. While DDR SDRAM has typical read latencies of between 2 and 3 bus cycles, DDR2
may have read latencies between 4 and 6 cycles. Thus, DDR2 memory must be operated at twice
the data rate to achieve the same latency.
Another cost of the increased bandwidth is the requirement that the chips are packaged in a more
expensive and more difficult to assemble BGA package as compared to the TSSOP package of
the previous memory generations such as DDR SDRAM and SDR SDRAM. This packaging
change was necessary to maintain signal integrity at higher bus speeds.
Power savings are achieved primarily due to an improved manufacturing process through die
shrinkage, resulting in a drop in operating voltage (1.8 V compared to DDR's 2.5 V). The lower
memory clock frequency may also enable power reductions in applications that do not require
the highest available data rates.
According to JEDEC[1] the maximum recommended voltage is 1.9 volts and should be
considered the absolute maximum when memory stability is an issue (such as in servers or other
mission critical devices). In addition, JEDEC states that memory modules must withstand up to
2.3 volts before incurring permanent damage (although they may not actually function correctly
at that level).
[edit] Specification standards
[edit] Chips and modules
For use in computers, DDR2 SDRAM is supplied in DIMMs with 240 pins and a single locating
notch. Laptop DDR2 SO-DIMMs have 200 pins and often come identified by an additional S in
their designation. DIMMs are identified by their peak transfer capacity (often called bandwidth).
Peak
Standard Memory Cycle I/O bus Data Module Timings[2][3]
transfer
name clock time clock rate name
rate (CL-tRCD-
(MHz) (ns) (MHz) (MT/s) tRP)
(MB/s)
DDR2-400B 3-3-3
100 10 200 400 PC2-3200 3200
DDR2-400C 4-4-4

DDR2-533B PC2- 3-3-3


133 71/2 266 533 4266
DDR2-533C 4200* 4-4-4

DDR2-667C PC2- 4-4-4


166 6 333 667 5333
DDR2-667D 5300* 5-5-5

DDR2-800C 4-4-4
DDR2-800D 200 5 400 800 PC2-6400 6400 5-5-5
DDR2-800E 6-6-6

DDR2-
1066E PC2- 6-6-6
266 33/4 533 1066 8533
DDR2- 8500* 7-7-7
1066F

* Some manufacturers label their DDR2 modules as PC2-4300, PC2-5400 or PC2-8600 instead
of the respective names suggested by JEDEC. At least one manufacturer has reported this
reflects successful testing at a higher-than standard data rate[4] whilst others simply round up for
the name.
Note: DDR2-xxx denotes data transfer rate, and describes raw DDR chips, whereas PC2-xxxx
denotes theoretical bandwidth (with the last two digits truncated), and is used to describe
assembled DIMMs. Bandwidth is calculated by taking transfers per second and multiplying by
eight. This is because DDR2 memory modules transfer data on a bus that is 64 data bits wide,
and since a byte comprises 8 bits, this equates to 8 bytes of data per transfer.
In addition to bandwidth and capacity variants, modules can
1. Optionally implement ECC, which is an extra data byte lane used for
correcting minor errors and detecting major errors for better reliability.
Modules with ECC are identified by an additional ECC in their designation.
PC2-4200 ECC is a PC2-4200 module with ECC.
2. Be "registered", which improves signal integrity (and hence potentially clock
rates and physical slot capacity) by electrically buffering the signals at a cost
of an extra clock of increased latency. Those modules are identified by an
additional R in their designation, whereas non-registered (a.k.a. "unbuffered")
RAM may be identified by an additional U in the designation. PC2-4200R is a
registered PC2-4200 module, PC2-4200R ECC is the same module but with
additional ECC.
3. Be fully buffered modules, which are designated by F or FB and do not have
the same notch position as other classes. Fully buffered modules cannot be
used with motherboards that are made for registered modules, and the
different notch position physically prevents their insertion.
Note: registered and un-buffered SDRAM generally cannot be mixed on the same channel.
Note that the highest-rated DDR2 modules in 2009 operate at 533 MHz (1066 MT/s), compared
to the highest-rated DDR modules operating at 200 MHz (400 MT/s). At the same time, the CAS
latency of 11.2 ns = 6 / (Bus clock rate) for the best PC2-8500 modules is comparable to that of
10 ns = 4 / (Bus clock rate) for the best PC-3200 modules.
[edit] Debut
DDR2 was introduced in the second quarter of 2003 at two initial clock rates: 200 MHz (referred
to as PC2-3200) and 266 MHz (PC2-4200). Both performed worse than the original DDR
specification due to higher latency, which made total access times longer. However, the original
DDR technology tops out at a clock rate around 200 MHz (400 MT/s). Higher performance DDR
chips exist, but JEDEC has stated that they will not be standardized. These modules are mostly
manufacturer optimizations of highest-yielding chips, drawing significantly more power than
slower-clocked modules, and usually do not offer much, if any, greater real-world performance.
DDR2 started to become competitive with the older DDR standard by the end of 2004, as
modules with lower latencies became available.[5]
[edit] Backward compatibility

DDR, DDR2 and DDR3 for Desktop PCs Comparison Graphic


DDR2 DIMMs are not designed to be backward compatible with DDR DIMMs. The notch on
DDR2 DIMMs is in a different position from DDR DIMMs, and the pin density is higher than
DDR DIMMs in desktops. DDR2 is a 240-pin module, DDR is a 184-pin module. Notebooks
have 200-pin modules for DDR and DDR2, however the notch on DDR modules is in a slightly
different position than that on DDR2 modules.
Higher performance DDR2 DIMMs are compatible with lower performance DDR2 DIMMs;
however, the higher performance module runs at the lower module's frequency. Using lower
performing DDR2 memory in a system capable of higher performance results in the bus running
at the rate of the lowest performance memory in use; however, in many systems this performance
hit can be mitigated to some extent by setting the timings of the memory to a lower latency
setting.

Difference
between ddr
and ddr2
Overview Of DDR Vs. DDR2
12:03 PM - March 1, 2004 by Patrick Schmid

?Share

Top of Form
238 764 1

Bottom of Form
Cache
memory
Cache memory is random access memory (RAM) that a computer microprocessor
can access more quickly than it can access regular RAM. As the microprocessor
processes data, it looks first in the cache memory and if it finds the data there (from
a previous reading of data), it does not have to do the more time-consuming
reading of data from larger memory.

Cache memory is sometimes described in levels of closeness and accessibility to the


microprocessor. An L1 cache is on the same chip as the microprocessor. (For example, the
PowerPC 601 processor has a 32 kilobyte level-1 cache built into its chip.) L2 is usually a
separate static RAM (SRAM) chip. The main RAM is usually a dynamic RAM (DRAM) chip.
In addition to cache memory, one can think of RAM itself as a cache of memory for hard disk
storage since all of RAM's contents come from the hard disk initially when you turn your
computer on and load the operating system (you are loading it into RAM) and later as you start
new applications and access new data. RAM can also contain a special area called a disk cache
that contains the data most recently read in from the hard disk.

"Cache memory" redirects here. For the general use, see cache.

A CPU cache is a cache used by the central processing unit of a computer to reduce the average
time to access memory. The cache is a smaller, faster memory which stores copies of the data
from the most frequently used main memory locations. As long as most memory accesses are
cached memory locations, the average latency of memory accesses will be closer to the cache
latency than to the latency of main memory.
When the processor needs to read from or write to a location in main memory, it first checks
whether a copy of that data is in the cache. If so, the processor immediately reads from or writes
to the cache, which is much faster than reading from or writing to main memory.
Most modern desktop and server CPUs have at least three independent caches: an instruction
cache to speed up executable instruction fetch, a data cache to speed up data fetch and store,
and a translation lookaside buffer (TLB) used to speed up virtual-to-physical address translation
for both executable instructions and data.
Contents
[hide]
• 1 Details of operation
• 2 Cache entry structure
• 3 Associativity
○ 3.1 2-way set associative cache
○ 3.2 Speculative execution
○ 3.3 2-way skewed associative cache
○ 3.4 Pseudo-associative cache
• 4 Cache misses
• 5 Address translation
○ 5.1 Virtual indexing and virtual aliases
○ 5.2 Homonym and synonym problems
○ 5.3 Virtual tags and vhints
○ 5.4 Page coloring
• 6 Cache hierarchy in a modern processor
○ 6.1 Specialized caches
 6.1.1 Victim cache
 6.1.2 Trace cache
○ 6.2 Multi-level caches
 6.2.1 Exclusive versus inclusive
○ 6.3 Example: the K8
○ 6.4 More hierarchies
• 7 Implementation
○ 7.1 History
 7.1.1 First TLB implementations
 7.1.2 First data cache
 7.1.3 In x86 microprocessors
 7.1.4 Current research
• 8 See also
• 9 Notes
• 10 References
• 11 External links
[edit] Details of operation

This section describes a typical data cache and some instruction caches; A TLB may have more
complexity and an instruction cache may be simpler. The diagram on the right shows two
memories. Each location in each memory contains data (a cache line), which in different designs
may range in size from 8 to 512 bytes.[citation needed] The size of the cache line is usually larger than
the size of the usual access requested by a CPU instruction[citation needed], which ranges from 1 to 16
bytes[citation needed] (the largest addresses and data handled by current 32 bit and 64 bit architectures
being 128 bits long, i.e. 16 bytes).[citation needed] Each location in each memory also has an index,
which is a unique number used to refer to that location. The index for a location in main memory
is called an address. Each location in the cache has a tag that contains the index of the datum in
main memory that has been cached. In a CPU's data cache these entries are called cache lines or
cache blocks.
When the processor needs to read or write a location in main memory, it first checks whether
that memory location is in the cache. This is accomplished by comparing the address of the
memory location to all tags in the cache that might contain that address. If the processor finds
that the memory location is in the cache, we say that a cache hit has occurred; otherwise, we
speak of a cache miss. In the case of a cache hit, the processor immediately reads or writes the
data in the cache line. The proportion of accesses that result in a cache hit is known as the hit
rate, and is a measure of the effectiveness of the cache for a given program or algorithm.
In the case of a miss, the cache allocates a new entry, which comprises the tag just missed and a
copy of the data. The reference can then be applied to the new entry just as in the case of a hit.
Read misses delay execution because they require data to be transferred from a much slower
memory than the cache itself. Write misses may occur without such penalty since the data can be
copied in the background. Instruction caches are similar to data caches but the CPU only
performs read accesses (instruction fetch) to the instruction cache. Instruction and data caches
can be separated for higher performance with Harvard CPUs but they can also be combined to
reduce the hardware overhead.
In order to make room for the new entry on a cache miss, the cache has to evict one of the
existing entries. The heuristic that it uses to choose the entry to evict is called the replacement
policy. The fundamental problem with any replacement policy is that it must predict which
existing cache entry is least likely to be used in the future. Predicting the future is difficult,
especially for hardware caches that use simple rules amenable to implementation in circuitry, so
there are a variety of replacement policies to choose from and no perfect way to decide among
them. One popular replacement policy, LRU, replaces the least recently used entry. Defining
some memory ranges non cacheable avoids affecting performance by storing in caches
information which are never re-used or seldom used. Cache misses are simply ignored for not
cacheable data. Cache entries may also be disabled or locked depending on the context.
If data are written to the cache, they must at some point be written to main memory as well. The
timing of this write is controlled by what is known as the write policy. In a write-through cache,
every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-
back cache, writes are not immediately mirrored to the main memory. Instead, the cache tracks
which locations have been written over (these locations are marked dirty). The data in these
locations are written back to the main memory when that data is evicted from the cache. For this
reason, a miss in a write-back cache may sometimes require two memory accesses to service: one
to first write the dirty location to memory and then another to read the new location from
memory.
There are intermediate policies as well. The cache may be write-through, but the writes may be
held in a store data queue temporarily, usually so that multiple stores can be processed together
(which can reduce bus turnarounds and so improve bus utilization).
The data in main memory being cached may be changed by other entities (e.g. peripherals using
direct memory access or multi-core processor), in which case the copy in the cache may become
out-of-date or stale. Alternatively, when the CPU in a multi-core processor updates the data in
the cache, copies of data in caches associated with other cores will become stale. Communication
protocols between the cache managers which keep the data consistent are known as cache
coherence protocols. Another possibility is to share non cacheable data.
The time taken to fetch one datum from memory (read latency) matters because the CPU will run
out of things to do while waiting for the datum. When a CPU reaches this state, it is called a
stall. As CPUs become faster, stalls due to cache misses displace more potential computation;
modern CPUs can execute hundreds of instructions in the time taken to fetch a single datum from
the main memory. Various techniques have been employed to keep the CPU busy during this
time. Out-of-order CPUs (Pentium Pro and later Intel designs, for example) attempt to execute
independent instructions after the instruction that is waiting for the cache miss data. Another
technology, used by many processors, is simultaneous multithreading (SMT), or -in Intel's
terminology- hyper-threading (HT), which allows an alternate thread to use the CPU core while a
first thread waits for data to come from main memory.
[edit] Cache entry structure
Cache row entries usually have the following structure:
tag data blocks valid bit

The data blocks (cache line) contain the actual data fetched from the main memory. The valid bit
(dirty bit) denotes that this particular entry has valid data.
An effective memory address is split (MSB to LSB) into the tag, the index and the displacement
(offset),
tag index displacement

The index length is bits and describes which row the data has been put

in. The displacement length is and specifies which block of the ones we
have stored we need. The tag length is address_length − index_length −
displacement_length and contains the most significant bits of the address, which are checked
against the current row (the row has been retrieved by index) to see if it is the one we need or
another, irrelevant memory location that happened to have the same index bits as the one we
want.
[edit] Associativity

Which memory locations can be cached by which cache locations

Associativity is a trade-off. If there are ten places to which the replacement policy could have
mapped a memory location, then to check if that location is in the cache, ten cache entries must
be searched. Checking more places takes more power, chip area, and potentially time. On the
other hand, caches with more associativity suffer fewer misses (see conflict misses, below), so
that the CPU wastes less time reading from the slow main memory. The rule of thumb is that
doubling the associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the
same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have
much less effect on the hit rate, and are generally done for other reasons (see virtual aliasing,
below).[citation needed]
In order of increasing (worse) hit times and decreasing (better) miss rates,
• direct mapped cache—the best (fastest) hit times, and so the best tradeoff
for "large" caches
• 2-way set associative cache
• 2-way skewed associative cache – "the best tradeoff for .... caches whose
sizes are in the range 4K-8K bytes" – André Seznec[1]
• 4-way set associative cache
• fully associative cache – the best (lowest) miss rates, and so the best tradeoff
when the miss penalty is very high

[edit] 2-way set associative cache


If each location in main memory can be cached in either of two locations in the cache, one
logical question is: which two? The simplest and most commonly used scheme, shown in the
right-hand diagram above, is to use the least significant bits of the memory location's index as
the index for the cache memory, and to have two entries for each index. One benefit of this
scheme is that the tags stored in the cache do not have to include that part of the main memory
address which is implied by the cache memory's index. Since the cache tags are fewer bits, they
take less area [on the microprocessor chip] and can be read and compared faster.
[edit] Speculative execution
One of the advantages of a direct mapped cache is that it allows simple and fast speculation.
Once the address has been computed, the one cache index which might have a copy of that
datum is known. That cache entry can be read, and the processor can continue to work with that
data before it finishes checking that the tag actually matches the requested address.
The idea of having the processor use the cached data before the tag match completes can be
applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just
one of the possible cache entries mapping to the requested address. This datum can then be used
in parallel with checking the full tag. The hint technique works best when used in the context of
address translation, as explained below.
[edit] 2-way skewed associative cache
Other schemes have been suggested, such as the skewed cache,[1] where the index for way 0 is
direct, as above, but the index for way 1 is formed with a hash function. A good hash function
has the property that addresses which conflict with the direct mapping tend not to conflict when
mapped with the hash function, and so it is less likely that a program will suffer from an
unexpectedly large number of conflict misses due to a pathological access pattern. The downside
is extra latency from computing the hash function.[2] Additionally, when it comes time to load a
new line and evict an old line, it may be difficult to determine which existing line was least
recently used, because the new line conflicts with data at different indexes in each way; LRU
tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-
associative caches have major advantages over conventional set-associative ones.[3]
[edit] Pseudo-associative cache
A true set-associative cache tests all the possible ways simultaneously, using something like a
content addressable memory. A pseudo-associative cache tests each possible way one at a time.
A hash-rehash cache is one kind of pseudo-associative cache.
In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast
as a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache,
closer to the miss rate of a fully associative cache. [2]
[edit] Cache misses
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results
in a main memory access with much longer latency. There are three kinds of cache misses:
instruction read miss, data read miss, and data write miss.
A cache read miss from an instruction cache generally causes the most delay, because the
processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched
from main memory.
A cache read miss from a data cache usually causes less delay, because instructions not
dependent on the cache read can be issued and continue execution until the data is returned from
main memory, and the dependent instructions can resume execution.
A cache write miss to a data cache generally causes the least delay, because the write can be
queued and there are few limitations on the execution of subsequent instructions. The processor
can continue until the queue is full.
In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an
attempt to find the best combination of size, associativity, block size, and so on. Sequences of
memory references performed by benchmark programs are saved as address traces. Subsequent
analyses simulate many different possible cache designs on these long address traces. Making
sense of how the many variables affect the cache hit rate can be quite confusing. One significant
contribution to this analysis was made by Mark Hill, who separated misses into three categories
(known as the Three Cs):
• Compulsory misses are those misses caused by the first reference to a
datum. Cache size and associativity make no difference to the number of
compulsory misses. Prefetching can help here, as can larger cache block
sizes (which are a form of prefetching). Compulsory misses are sometimes
referred to as cold misses.
• Capacity misses are those misses that occur regardless of associativity or
block size, solely due to the finite size of the cache. The curve of capacity
miss rate versus cache size gives some measure of the temporal locality of a
particular reference stream. Note that there is no useful notion of a cache
being "full" or "empty" or "near capacity": CPU caches almost always have
nearly every line filled with a copy of some line in main memory, and nearly
every allocation of a new line requires the eviction of an old line.
• Conflict misses are those misses that could have been avoided, had the
cache not evicted an entry earlier. Conflict misses can be further broken
down into mapping misses, that are unavoidable given a particular amount of
associativity, and replacement misses, which are due to the particular victim
choice of the replacement policy.
Miss rate versus cache size on the Integer portion of SPEC CPU2000

The graph to the right summarizes the cache performance seen on the Integer portion of the
SPEC CPU2000 benchmarks, as collected by Hill and Cantin.[4] These benchmarks are intended
to represent the kind of workload that an engineering workstation computer might see on any
given day. The reader should keep in mind that finding benchmarks which are even usefully
representative of many programs has been very difficult, and there will always be important
programs with very different behavior than what is shown here.
We can see the different effects of the three Cs in this graph.
At the far right, with cache size labelled "Inf", we have the compulsory misses. If we wish to
improve a machine's performance on SpecInt2000, increasing the cache size beyond 1 MB is
essentially futile. That's the insight given by the compulsory misses.
The fully associative cache miss rate here is almost representative of the capacity miss rate. The
difference is that the data presented is from simulations assuming an LRU replacement policy.
Showing the capacity miss rate would require a perfect replacement policy, i.e. an oracle that
looks into the future to find a cache entry which is actually not going to be hit.
Note that our approximation of the capacity miss rate falls steeply between 32 KB and 64 KB.
This indicates that the benchmark has a working set of roughly 64 KB. A CPU cache designer
examining this benchmark will have a strong incentive to set the cache size to 64 KB rather than
32 KB. Note that, on this benchmark, no amount of associativity can make a 32 KB cache
perform as well as a 64 KB 4-way, or even a direct-mapped 128 KB cache.
Finally, note that between 64 KB and 1 MB there is a large difference between direct-mapped
and fully associative caches. This difference is the conflict miss rate. The insight from looking at
conflict miss rates is that secondary caches benefit a great deal from high associativity.
This benefit was well known in the late 80s and early 90s, when CPU designers could not fit
large caches on-chip, and could not get sufficient bandwidth to either the cache data memory or
cache tag memory to implement high associativity in off-chip caches. Desperate hacks were
attempted: the MIPS R8000 used expensive off-chip dedicated tag SRAMs, which had
embedded tag comparators and large drivers on the match lines, in order to implement a 4 MB 4-
way associative cache. The MIPS R10000 used ordinary SRAM chips for the tags. Tag access
for both ways took two cycles. To reduce latency, the R10000 would guess which way of the
cache would hit on each access.
[edit] Address translation
Main article: Translation lookaside buffer

Most general purpose CPUs implement some form of virtual memory. To summarize, each
program running on the machine sees its own simplified address space, which contains code and
data for that program only. Each program uses this virtual address space without regard for
where it exists in physical memory.
Virtual memory requires the processor to translate virtual addresses generated by the program
into physical addresses in main memory. The portion of the processor that does this translation is
known as the memory management unit (MMU). The fast path through the MMU can perform
those translations stored in the translation lookaside buffer (TLB), which is a cache of mappings
from the operating system's page table.
For the purposes of the present discussion, there are three important features of address
translation:
• Latency: The physical address is available from the MMU some time,
perhaps a few cycles, after the virtual address is available from the address
generator.
• Aliasing: Multiple virtual addresses can map to a single physical address.
Most processors guarantee that all updates to that single physical address
will happen in program order. To deliver on that guarantee, the processor
must ensure that only one copy of a physical address resides in the cache at
any given time.
• Granularity: The virtual address space is broken up into pages. For instance,
a 4 GB virtual address space might be cut up into 1048576 pages of 4 KB
size, each of which can be independently mapped. There may be multiple
page sizes supported; see virtual memory for elaboration.
A historical note: some early virtual memory systems were very slow, because they required an
access to the page table (held in main memory) before every programmed access to main
memory.[NB 1] With no caches, this effectively cut the speed of the machine in half. The first
hardware cache used in a computer system was not actually a data or instruction cache, but rather
a TLB.
Caches can be divided into 4 types, based on whether the index or tag correspond to physical or
virtual addresses:
• Physically indexed, physically tagged (PIPT) caches use the physical
address for both the index and the tag. While this is simple and avoids
problems with aliasing, it is also slow, as the physical address must be looked
up (which could involve a TLB miss and access to main memory) before that
address can be looked up in the cache.
• Virtually indexed, virtually tagged (VIVT) caches use the virtual address
for both the index and the tag. This caching scheme can result in much faster
lookups, since the MMU doesn't need to be consulted first to determine the
physical address for a given virtual address. However, VIVT suffers from
aliasing problems, where several different virtual addresses may refer to the
same physical address. The result is that such addresses would be cached
separately despite referring to the same memory, causing coherency
problems. Another problem is homonyms, where the same virtual address
maps to several different physical addresses. It is not possible to distinguish
these mappings by only looking at the virtual index, though potential
solutions include: flushing the cache after a context switch, forcing address
spaces to be non-overlapping, tagging the virtual address with an address
space ID (ASID), or using physical tags. Additionally, there is a problem that
virtual-to-physical mappings can change, which would require flushing cache
lines, as the VAs would no longer be valid.
• Virtually indexed, physically tagged (VIPT) caches use the virtual address
for the index and the physical address in the tag. The advantage over PIPT is
lower latency, as the cache line can be looked up in parallel with the TLB
translation, however the tag can't be compared until the physical address is
available. The advantage over VIVT is that since the tag has the physical
address, the cache can detect homonyms. VIPT requires more tag bits, as the
index bits no longer represent the same address.
• Physically indexed, virtually tagged caches are only theoretical as they
would basically be useless.[7]
The speed of this recurrence (the load latency) is crucial to CPU performance, and so most
modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to
proceed in parallel with fetching the data from the cache RAM.
But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual
aliases grows with cache size, and as a result most level-2 and larger caches are physically
indexed.
Caches have historically used both virtual and physical addresses for the cache tags, although
virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup,
then the physical address is available in time for tag compare, and there is no need for virtual
tagging. Large caches, then, tend to be physically tagged, and only small, very low latency
caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded
by vhints, as described below.
[edit] Virtual indexing and virtual aliases
The usual way the processor guarantees that virtually aliased addresses act as a single storage
location is to arrange that only one virtual alias can be in the cache at any given time.
Whenever a new entry is added to a virtually indexed cache, the processor searches for any
virtual aliases already resident and evicts them first. This special handling happens only during a
cache miss. No special work is necessary during a cache hit, which helps keep the fast path fast.
The most straightforward way to find aliases is to arrange for them all to map to the same
location in the cache. This happens, for instance, if the TLB has e.g. 4 KB pages, and the cache
is direct mapped and 4 KB or less.
Modern level-1 caches are much larger than 4 KB, but virtual memory pages have stayed that
size. If the cache is e.g. 16 KB and virtually indexed, for any virtual address there are four cache
locations that could hold the same physical location, but aliased to different virtual addresses. If
the cache misses, all four locations must be probed to see if their corresponding physical
addresses match the physical address of the access that generated the miss.
These probes are the same checks that a set associative cache uses to select a particular match.
So if a 16 KB virtually indexed cache is 4-way set associative and used with 4 KB virtual
memory pages, no special work is necessary to evict virtual aliases during cache misses because
the checks have already happened while checking for a cache hit.
Using the AMD Athlon as an example again, it has a 64 KB level-1 data cache, 4 KB pages, and
2-way set associativity. When the level-1 data cache suffers a miss, 2 of the 16 (==64 KB/4 KB)
possible virtual aliases have already been checked, and seven more cycles through the tag check
hardware are necessary to complete the check for virtual aliases.
[edit] Homonym and synonym problems
The cache that relies on the virtual indexing and tagging becomes inconsistent after the same
virtual address is mapped into different physical addresses (homonym). This can be solved by
using physical address for tagging or by storing the address space id in the cache line. However
the latter of these two approaches does not help against the synonym problem, where several
cache lines end up storing data for the same physical address. Writing to such location may
update only one location in the cache, leaving others with inconsistent data. Problem might be
solved by using non overlapping memory layouts for different address spaces or otherwise the
cache (or part of it) must be flushed when the mapping changes.[8]
[edit] Virtual tags and vhints
Virtual tagging is possible too. The great advantage of virtual tags is that, for associative caches,
they allow the tag match to proceed before the virtual to physical translation is done. However,
• Coherence probes and evictions present a physical address for action. The
hardware must have some means of converting the physical addresses into a
cache index, generally by storing physical tags as well as virtual tags. For
comparison, a physically tagged cache does not need to keep virtual tags,
which is simpler.
• When a virtual to physical mapping is deleted from the TLB, cache entries
with those virtual addresses will have to be flushed somehow. Alternatively, if
cache entries are allowed on pages not mapped by the TLB, then those
entries will have to be flushed when the access rights on those pages are
changed in the page table.
It is also possible for the operating system to ensure that no virtual aliases are simultaneously
resident in the cache. The operating system makes this guarantee by enforcing page coloring,
which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It
has not been used recently, as the hardware cost of detecting and evicting virtual aliases has
fallen and the software complexity and performance penalty of perfect page coloring has risen.
It can be useful to distinguish the two functions of tags in an associative cache: they are used to
determine which way of the entry set to select, and they are used to determine if the cache hit or
missed. The second function must always be correct, but it is permissible for the first function to
guess, and get the wrong answer occasionally.
Some processors (e.g. early SPARCs) have caches with both virtual and physical tags. The
virtual tags are used for way selection, and the physical tags are used for determining hit or miss.
This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple
software interface of a physically tagged cache. It bears the added cost of duplicated tags,
however. Also, during miss processing, the alternate ways of the cache line indexed have to be
probed for virtual aliases and any matches evicted.
The extra area (and some latency) can be mitigated by keeping virtual hints with each cache
entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for
selecting the way of the cache from which to get data and a physical tag. Like a virtually tagged
cache, there may be a virtual hint match but physical tag mismatch, in which case the cache entry
with the matching hint must be evicted so that cache accesses after the cache fill at this address
will have just one hint match. Since virtual hints have fewer bits than virtual tags distinguishing
them from one another, a virtually hinted cache suffers more conflict misses than a virtually
tagged cache.
Perhaps the ultimate reduction of virtual hints can be found in the Pentium 4 (Willamette and
Northwood cores). In these processors the virtual hint is effectively 2 bits, and the cache is 4-way
set associative. Effectively, the hardware maintains a simple permutation from virtual address to
cache index, so that no content-addressable memory (CAM) is necessary to select the right one
of the four ways fetched.
[edit] Page coloring
Main article: Cache coloring

Large physically indexed caches (usually secondary caches) run into a problem: the operating
system rather than the application controls which pages collide with one another in the cache.
Differences in page allocation from one program run to the next lead to differences in the cache
collision patterns, which can lead to very large differences in program performance. These
differences can make it very difficult to get a consistent and repeatable timing for a benchmark
run.
To understand the problem, consider a CPU with a 1 MB physically indexed direct-mapped
level-2 cache and 4 KB virtual memory pages. Sequential physical pages map to sequential
locations in the cache until after 256 pages the pattern wraps around. We can label each physical
page with a color of 0–255 to denote where in the cache it can go. Locations within physical
pages with different colors cannot conflict in the cache.
A programmer attempting to make maximum use of the cache may arrange his program's access
patterns so that only 1 MB of data need be cached at any given time, thus avoiding capacity
misses. But he should also ensure that the access patterns do not have conflict misses. One way
to think about this problem is to divide up the virtual pages the program uses and assign them
virtual colors in the same way as physical colors were assigned to physical pages before. The
programmer can then arrange the access patterns of his code so that no two pages with the same
virtual color are in use at the same time. There is a wide literature on such optimizations (e.g.
loop nest optimization), largely coming from the High Performance Computing (HPC)
community.
The snag is that while all the pages in use at any given moment may have different virtual colors,
some may have the same physical colors. In fact, if the operating system assigns physical pages
to virtual pages randomly and uniformly, it is extremely likely that some pages will have the
same physical color, and then locations from those pages will collide in the cache (this is the
birthday paradox).
The solution is to have the operating system attempt to assign different physical color pages to
different virtual colors, a technique called page coloring. Although the actual mapping from
virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep
track of and have little benefit, so most approaches to page coloring simply try to keep physical
and virtual page colors the same.
If the operating system can guarantee that each physical page maps to only one virtual color,
then there are no virtual aliases, and the processor can use virtually indexed caches with no need
for extra virtual alias probes during miss handling. Alternatively, the O/S can flush a page from
the cache whenever it changes from one virtual color to another. As mentioned above, this
approach was used for some early SPARC and RS/6000 designs.
[edit] Cache hierarchy in a modern processor
Modern processors have multiple interacting caches on chip.
[edit] Specialized caches
Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-
physical address translation, and data fetch (see classic RISC pipeline). The natural design is to
use different physical caches for each of these points, so that no one physical resource has to be
scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least
three separate caches (instruction, TLB, and data), each specialized to its particular role.
Pipelines with separate instruction and data caches, now predominant, are said to have a Harvard
architecture. Originally, this phrase referred to machines with separate instruction and data
memories, which proved not at all popular. Most modern CPUs have a single-memory von
Neumann architecture.
[edit] Victim cache
A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The
victim cache lies between the main cache and its refill path, and only holds blocks that were
evicted from the main cache. The victim cache is usually fully associative, and is intended to
reduce the number of conflict misses. Many commonly used programs do not require an
associative mapping for all the accesses. In fact, only a small fraction of the memory accesses of
the program require high associativity. The victim cache exploits this property by providing high
associativity to only these accesses. It was introduced by Norman Jouppi in 1990.
[edit] Trace cache
One of the more extreme examples of cache specialization is the trace cache found in the Intel
Pentium 4 microprocessors. A trace cache is a mechanism for increasing the instruction fetch
bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of
instructions that have already been fetched and decoded.
The earliest widely acknowledged academic publication of trace cache was by Eric Rotenberg,
Steve Bennett, and Jim Smith in their 1996 paper "Trace Cache: a Low Latency Approach to
High Bandwidth Instruction Fetching."
An earlier publication is US Patent 5,381,533, "Dynamic flow instruction cache memory
organized around trace segments independent of virtual address line", by Alex Peleg and Uri
Weiser of Intel Corp., patent filed March 30, 1994, a continuation of an application filed in 1992,
later abandoned.
A trace cache stores instructions either after they have been decoded, or as they are retired.
Generally, instructions are added to trace caches in groups representing either individual basic
blocks or dynamic instruction traces. A dynamic trace ("trace path") contains only instructions
whose results are actually used, and eliminates instructions following taken branches (since they
are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows
the instruction fetch unit of a processor to fetch several basic blocks, without having to worry
about branches in the execution flow.
Trace lines are stored in the trace cache based on the program counter of the first instruction in
the trace and a set of branch predictions. This allows for storing different trace paths that start on
the same address, each representing different branch outcomes. In the instruction fetch stage of a
pipeline, the current program counter along with a set of branch predictions is checked in the
trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to
a regular cache or to memory for these instructions. The trace cache continues to feed the fetch
unit until the trace line ends or until there is a misprediction in the pipeline. If there is a miss, a
new trace starts to be built.
Trace caches are also used in processors like the Intel Pentium 4 to store already decoded micro-
operations, or translations of complex x86 instructions, so that the next time an instruction is
needed, it does not have to be decoded again.
See the full text of Smith, Rotenberg and Bennett's paper at Citeseer.
[edit] Multi-level caches
Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have
better hit rates but longer latency. To address this tradeoff, many computers use multiple levels
of cache, with small fast caches backed up by larger slower caches.
Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits,
the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is
checked, and so on, before external memory is checked.
As the latency difference between main memory and the fastest cache has become larger, some
processors have begun to utilize as many as three levels of on-chip cache. For example, the
Alpha 21164 (1995) had 1 to 64MB off-chip L3 cache; the IBM POWER4 (2001) had a 256 MB
L3 cache off-chip, shared among several processors; the Itanium 2 (2003) had a 6 MB unified
level 3 (L3) cache on-die; the Itanium 2 (2003) MX 2 Module incorporates two Itanium2
processors along with a shared 64 MB L4 cache on a MCM that was pin compatible with a
Madison processor; Intel's Xeon MP product code-named "Tulsa" (2006) features 16 MB of on-
die L3 cache shared between two processor cores; the AMD Phenom II (2008) has up to 6 MB
on-die unified L3 cache; and the Intel Core i7 (2008) has an 8 MB on-die unified L3 cache that is
inclusive, shared by all cores. The benefits of an L3 cache depend on the application's access
patterns.
Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered
the smallest, fastest cache in the system, with the special characteristic that it is scheduled in
software—typically by a compiler, as it allocates registers to hold values retrieved from main
memory. (See especially loop nest optimization.) Register files sometimes also have hierarchy:
The Cray-1 (circa 1976) had 8 address "A" and 8 scalar data "S" registers that were generally
usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to
access, but were faster than main memory. The "B" and "T" registers were provided because the
Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)
[edit] Exclusive versus inclusive
Multi-level caches introduce new design decisions. For instance, in some processors, all data in
the L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive.
Other processors (like the AMD Athlon) have exclusive caches — data is guaranteed to be in at
most one of the L1 and L2 caches, never in both. Still other processors (like the Intel Pentium II,
III, and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may
often do so. There is no universally accepted name for this intermediate policy, although the term
mainly inclusive has been used.[citation needed]
The advantage of exclusive caches is that they store more data. This advantage is larger when the
exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times
larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line
in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just
copying a line from L2 to L1, which is what an inclusive cache does.
One advantage of strictly inclusive caches is that when external devices or other processors in a
multiprocessor system wish to remove a cache line from the processor, they need only have the
processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache
must be checked as well. As a drawback, there is a correlation between the associativities of L1
and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the
effective associativity of the L1 caches is restricted.
Another advantage of inclusive caches is that the larger cache can use larger cache lines, which
reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the
same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary
cache is an order of magnitude larger than the primary, and the cache data is an order of
magnitude larger than the cache tags, this tag area saved can be comparable to the incremental
area needed to store the L1 cache data in the L2.
[edit] Example: the K8
To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8
core in the AMD Athlon 64 CPU.[9]

Example of hierarchy, the K8

The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a
data cache. Each of these caches is specialized:
• The instruction cache keeps copies of 64-byte lines of memory, and fetches
16 bytes each cycle. Each byte in this cache is stored in ten bits rather than
8, with the extra bits marking the boundaries of instructions (this is an
example of predecoding). The cache has only parity protection rather than
ECC, because parity is smaller and any damaged data can be replaced by
fresh data fetched from memory (which always has an up-to-date copy of
instructions).
• The instruction TLB keeps copies of page table entries (PTEs). Each cycle's
instruction fetch has its virtual address translated through this TLB into a
physical address. Each entry is either 4 or 8 bytes in memory. Each of the
TLBs is split into two sections, one to keep PTEs that map 4 KB, and one to
keep PTEs that map 4 MB or 2 MB. The split allows the fully associative match
circuitry in each section to be simpler. The operating system maps different
sections of the virtual address space with different size PTEs.
• The data TLB has two copies which keep identical entries. The two copies
allow two data accesses per cycle to translate virtual addresses to physical
addresses. Like the instruction TLB, this TLB is split into two kinds of entries.
• The data cache keeps copies of 64-byte lines of memory. It is split into 8
banks (each storing 8 KB of data), and can fetch two 8-byte data each cycle
so long as those data are in different banks. There are two copies of the tags,
because each 64-byte line is spread among all 8 banks. Each tag copy
handles one of the two accesses per cycle.
The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which
store only PTEs mapping 4 KB. Both instruction and data caches, and the various TLBs, can fill
from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data
caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1
data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE
which is also in one of the TLBs—the operating system is responsible for keeping the TLBs
coherent by flushing portions of them when the page tables in memory are updated.
The K8 also caches information that is never stored in memory—prediction information. These
caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly
complex branch prediction, with tables that help predict whether branches are taken and other
tables which predict the targets of branches and jumps. Some of this information is associated
with instructions, in both the level 1 instruction cache and the unified secondary cache.
The K8 uses an interesting trick to store prediction information with instructions in the secondary
cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an
alpha particle strike) by either ECC or parity, depending on whether those lines were evicted
from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC
code, lines from the instruction cache have a few spare bits. These bits are used to cache branch
prediction information associated with those instructions. The net result is that the branch
predictor has a larger effective history table, and so has better accuracy.
[edit] More hierarchies
Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in the
DEC Alpha 21264), and various specialized predictors are likely to flourish in future processors.
These predictors are caches in that they store information that is costly to compute. Some of the
terminology used when discussing predictors is the same as that for caches (one speaks of a hit
in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
The K8 keeps the instruction and data caches coherent in hardware, which means that a store
into an instruction closely following the store instruction will change that following instruction.
Other processors, like those in the Alpha and MIPS family, have relied on software to keep the
instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a
program calls an operating system facility to ensure coherency.
[edit] Implementation
Cache reads are the most common CPU operation that takes more than a single cycle. Program
execution time tends to be very sensitive to the latency of a level-1 data cache hit. A great deal of
design effort, and often power and silicon area are expended making the caches as fast as
possible.
The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated
with an adder, the relevant portion of the address extracted and used to index an SRAM, which
returns the loaded data. The data is byte aligned in a byte shifter, and from there is bypassed to
the next operation. There is no need for any tag checking in the inner loop — in fact, the tags
need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for
the loaded data must be read, and checked against the virtual address to make sure there was a
cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is
restarted.
An associative cache is more complicated, because some form of tag must be read to determine
which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N
possible tags and N data in parallel, and then chooses the data associated with the matching tag.
Level-2 caches sometimes save power by reading the tags first, so that only one data element is
read from the data SRAM.

Read path for a 2-way associative cache

The diagram to the right is intended to clarify the manner in which the various fields of the
address are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows
the SRAMs, indexing, and multiplexing for a 4 KB, 2-way set-associative, virtually indexed and
virtually tagged cache with 64 B lines, a 32b read width and 32b virtual address.
Because the cache is 4 KB and has 64 B lines, there are just 64 lines in the cache, and we read
two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although
any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs,
it is simplest to use the least significant bits.
Similarly, because the cache is 4 KB and has a 4 B read path, and reads two ways for each
access, the Data SRAM is 512 rows by 8 bytes wide.
A more modern cache might be 16 KB, 4-way set-associative, virtually indexed, virtually hinted,
and physically tagged, with 32 B lines, 32b read width and 36b physical addresses. The read path
recurrence for such a cache looks very similar to the path above. Instead of tags, vhints are read,
and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is
translated into a physical address by the TLB, and the physical tag is read (just one, as the vhint
supplies which way of the cache to read). Finally the physical address is compared to the
physical tag to determine if a hit has occurred.
Some SPARC designs have improved the speed of their L1 caches by a few gate delays by
collapsing the virtual address adder into the SRAM decoders. See Sum addressed decoder.
[edit] History
The early history of cache technology is closely tied to the invention and use of virtual memory.
[citation needed]
Because of scarcity and cost of semi-conductors memories, early mainframe
computers in 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual
memory used by programs. The memory technologies would span semi-conductor, magnetic
core, drum and disc. Virtual memory seen and used by programs would be flat and caching
would be used to fetch data and instructions into the fastest memory ahead of processor access.
Extensive studies were done to optimize the cache sizes. Optimal values were found to depend
greatly on the programming language used with Algol needing the smallest and Fortran and
Cobol needing the largest cache sizes.[disputed – discuss]
In the early days of microcomputer technology, memory access was only slightly slower than
register access. But since the 1980s [10] the performance gap between processor and memory has
been growing. Microprocessors have advanced much faster than memory, especially in terms of
their operating frequency, so memory became a performance bottleneck. While it was technically
possible to have all the main memory as fast as the CPU, a more economically viable path has
been taken: use plenty of low-speed memory, but also introduce a small high-speed cache
memory to alleviate the performance gap. This provided an order of magnitude more capacity—
for the same price—with only a slightly reduced combined performance.
[edit] First TLB implementations
The first documented uses of a TLB were on the GE 645[11] and the IBM 360/67[12], both of
which used an associative memory as a TLB.
[edit] First data cache
The first documented use of a data cache was on the IBM System/360 Model 85[13].
[edit] In x86 microprocessors
As the x86 microprocessors reached clock rates of 20 MHz and above in the 386, small amounts
of fast cache memory began to be featured in systems to improve performance. This was because
the DRAM used for main memory had significant latency, up to 120 ns, as well as refresh cycles.
The cache was constructed from more expensive, but significantly faster, SRAM, which at the
time had latencies around 10 ns. The early caches were external to the processor and typically
located on the motherboard in the form of eight or nine DIP devices placed in sockets to enable
the cache as an optional extra or upgrade feature.
Some versions of the Intel 386 processor could support 16 to 64 KB of external cache.
With the 486 processor, an 8 KB cache was integrated directly into the CPU die. This cache was
termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2)
cache. These on-motherboard caches were much larger, with the most common size being 256
KB. The popularity of on-motherboard cache continued on through the Pentium MMX era but
was made obsolete by the introduction of SDRAM and the growing disparity between bus clock
rates and CPU clock rates, which caused on-motherboard cache to be only slightly faster than
main memory.
The next evolution in cache implementation in the x86 microprocessors began with the Pentium
Pro, which brought the secondary cache onto the same package as the microprocessor, clocked at
the same frequency as the microprocessor.
On-motherboard caches enjoyed prolonged popularity thanks to the AMD K6-2 and AMD K6-III
processors that still used the venerable Socket 7, which was previously used by Intel with on-
motherboard caches. K6-III included 256kb on-die L2 cache and took advantage of the on-board
cache as a third level cache, named L3 (motherboards with up to 2MB of on-board cache were
produced). After the Socket-7 became obsolete, on-motherboard cache disappeared from the x86
systems.
The three-level cache was used again first with the introduction of multiple processor cores,
where the L3 was added to the CPU die. It became common to have the three levels be larger in
size than the next so that it became not uncommon to find Level 3 cache sizes of eight
megabytes. This trend appears to continue for the foreseeable future.
[edit] Current research
Early cache designs focused entirely on the direct cost of cache and RAM and average execution
speed. More recent cache designs also consider energy efficiency, fault tolerance, and other
goals. [14]
There are several tools available to computer architects to help explore tradeoffs between cache
cycle time, energy, and area. These tools include the open-source CACTI cache simulator[15] and
the open-source SimpleScalar instruction set simulator.

[edit] See also


• Cache coherency
• Cache algorithms
• Dinero (Cache simulator by University of Wisconsin System)
• No-write allocation
• Memoization, briefly defined in List of computer term etymologies
• Scratchpad RAM
• Write buffer
Internal Cache
Internal cache memory used by cpu of the computer to reduce the average time of
access memory. It is a just like faster memory which stores the data which is mostly
used by memory. When cpu read or write any data again and again then that data
store in internal cache memory and if so, the processor immediately reads or write to
the internal cache, which is must faster than reading or writing to the main
memory(RAM).

External Cache
This setting enables or disables the external cache on your processor, also known as the L2 or
level 2 cache. Most 486 or later motherboards include this cache memory. Like the internal
cache setting, this should be enabled at all times unless you are disabling it for troubleshooting
purposes. Disabling the external cache will cause your system to slow down dramatically, but
you can use it if you are having system crashes and suspect a problem with the cache chips.
On some BIOSes you may see three choices: "Disabled", "Write Through" and "Write Back".
These refer to the cache's write policy. The write back cache policy will produce the best
performance.
Note: There are some motherboards out on the market, particular PCI-based 486
motherboards, that have fake level 2 cache on the board. One way to test for this is to
disable the external cache and see if there is a performance decrease in the system. If there
isn't, you never had any level 2 cache to begin with. In addition, some systems will report (in the
System Configuration Summary) the presence of enabled level 2 cache even when it is disabled.
This is a BIOS that has been "doctored" and is a sign of fake cache on the motherboard as well.
Amazing the trouble people will go to to cheat someone out of a few bucks, isn't it?

Level 1 Cache
Pronounced cash, a special high-speed storage mechanism. It can be either a reserved
section of main memory or an independent high-speed storage device. Two types of caching
are commonly used in personal computers: memory caching and disk caching.
A memory cache, sometimes called a cache store or RAM cache, is a portion of memory
made of high-speed static RAM (SRAM) instead of the slower and cheaper dynamic RAM
(DRAM) used for main memory. Memory caching is effective because most programs access
the same data or instructions over and over. By keeping as much of this information as
possible in SRAM, the computer avoids accessing the slower DRAM.
Some memory caches are built into the architecture of microprocessors. The Intel® 80486
microprocessor, for example, contains an 8K memory cache, and the Pentium has 16K
caches. Such internal caches are often called Level 1 (L1) caches. Most modern PCs also
come with external cache memory, called Level 2 (L2) caches. These caches sit between the
CPU and the DRAM. Like L1 caches, L2 caches are composed of SRAM but they are much
larger.
Disk caching works under the same principle as memory caching, but instead of using
high-speed SRAM, a disk cache uses conventional main memory. The most recently accessed
data from the disk (as well as adjacent sectors) is stored in a memory buffer. When a
program needs to access data from the disk, it first checks the disk cache to see if the data is
there. Disk caching can dramatically improve the performance of applications, because
accessing a byte of data in RAM can be thousands of times faster than accessing a byte on a
hard disk.
When data is found in the cache, it is called a cache hit, and the effectiveness of caching is
judged by its hit rate. Many cache systems use a technique known as smart caching, in
which the system can recognize certain types of frequently used data. The strategies for
determining which information should be kept in the caches constitutes some of the more
interesting problems in computer science.

LEVEL3 cache
Level 3 or L3 cache is specialized memory that works hand-in-hand with L1 and L2
cache to improve computer performance. L1, L2 and L3 cache are computer
processing unit (CPU) caches, verses other types of caches in the system such as
hard disk cache. CPU cache caters to the needs of the microprocessor by
anticipating data requests so that processing instructions are provided without delay.
CPU cache is faster than random access memory (RAM), and is designed to prevent
bottlenecks in performance.
Expanded Memory (EMS)
In modern systems, the memory that is above 1 MB is used as extended memory (XMS).
Extended memory is the most "natural" way to use memory beyond the first megabyte, because it
can be addressed directly and efficiently. This is what is used by all protected-mode operating
systems (including all versions of Microsoft Windows) and programs such as DOS games that use
protected mode. There is, however, an older standard for accessing memory above 1 MB which is
called expanded memory. It uses a protocol called the Expanded Memory Specification or EMS.
EMS was originally created to overcome the 1 MB addressing limitations of the first generation
8088 and 8086 CPUs. In the mid-80s, when these early systems were still in common use, many
users became distressed that they were constantly running out of memory. This happened
particularly when using large spreadsheets in Lotus 1-2-3, then probably the most popular
application on the PC. However, there really was no easy way to make more memory available
due to the 1 MB barrier.
To address this problem, a new standard was created by Lotus, Intel and Microsoft called the LIM
EMS standard, where "LIM" of course is the initials of the companies involved. (I wonder how
they ever agreed to get Microsoft to put its name last. I'll bet they couldn't pull that off today. ;^) )
To use EMS, a special adapter board was added to the PC containing additional memory and
hardware switching circuits. The memory on the board was divided into 16 KB logical memory
blocks, called pages or banks. Both of these terms are used in other contexts as well, so don't
confuse them.
What the circuitry on the board does is to make use of a 64 KB block of real memory in the UMA,
which is called the EMS Page Frame. This frame, or window, is normally located at addresses
D0000-DFFFFh, and is capable of holding four 16 KB EMS pages. When the contents of a
particular part of expanded memory is needed by the PC, it is switched into one of these areas,
where it can be accessed by programs supporting the LIM specification. After changing the
contents of a page, it is swapped out and a new one swapped in. Pages that have been swapped out
cannot be seen by the program until they are swapped back in.
This concept is called bank-switched memory, and in a way is not all that different from virtual
memory, except here the swapping isn't being done to disk but rather to other areas of memory.
But here, a lot more swapping is being done than in virtual memory. Because of all the swapping,
EMS is horribly inefficient. If you have 4 MB of EMS memory, you can only access 64 KB of it
at a time, or 1.5%. This means a great deal of time is spent just shuffling memory around.
With the creation of newer processors that support extended memory above 1 MB, expanded
memory is very obsolete. Note that EMS and XMS are physically different; an expanded memory
card cannot be used as extended memory, and extended memory cannot be used directly as
expanded memory. These EMS cards have been obsolete for over 10 years now and are rarely if
ever seen except in left-over relics from the mid-80s. However, the software that uses them
persists more stubbornly (of course).
Included in MS-DOS is a driver called EMM386.EXE that you can use in your CONFIG.SYS file
to allocate a portion of extended memory to be used to emulate expanded memory, for programs
that still use it. Running EMM386.EXE (as long as you don't use the "NOEMS" parameter) will
set up a page frame and dedicate a portion of extended memory for use as expanded memory.
EMM386 has a host of options that can be used to change how much memory is set aside, the
location of the page frame, etc.
It is recommended that if at all possible, you avoid using programs that still require EMS. In
addition to being slow and cumbersome, using extended memory for EMS makes it unavailable
for use as extended memory by other applications, and the page frame wastes 64 KB of the upper
memory area that could be used for drivers. This means that indirectly, up to 64 KB of
conventional memory can be lost when using EMS. At the present time, most programs have been
updated to use more modern, faster extended memory, but some persist in using EMS. The biggest
culprits in this area seem to be games whose authors are still using EMS so they can run in DOS
real mode. Even these are few and far between now.

Extended memory
In computing, extended memory refers to memory above the first megabyte of address space in
an IBM PC or compatible with an 80286 or later processor. The term is mainly used under the
DOS and Windows operating systems. DOS programs, running in real mode or virtual x86 mode,
cannot directly access this memory, but are able to do so through an application programming
interface called the eXtended Memory Specification (XMS). This API is implemented by a driver
(such as HIMEM.SYS) or the operating system, which takes care of memory management and
copying memory between conventional and extended memory, by temporarily switching the
processor into protected mode. In this context the term "extended memory" may refer to either the
whole of the extended memory or only the portion available through this API.
Extended memory can also be accessed directly by DOS programs running in protected mode
using VCPI or DPMI, two (different and incompatible) methods of using protected mode under
DOS.
Extended memory should not be confused with expanded memory, an earlier method for
expanding the IBM PC's memory capacity beyond 640 kb using an expansion card with bank
switched memory modules. Because of the available support for expanded memory in popular
applications, device drivers were developed that emulated expanded memory using extended
memory. Later two additional methods where developed allowing direct access to a small portion
of extended memory from real mode. These memory areas are referred to as the high memory area
(HMA) and the upper memory area (UMA; also referred to as upper memory blocks or UMBs).
Contents
[hide]
• 1 Overview
• 2 eXtended Memory Specification
(XMS)
• 3 See also
• 4 References
[edit] Overview
On x86-based PCs, extended memory is only available with an Intel 80286 processor or higher.
Only these chips can address more than 1 megabyte of RAM. The earlier 8086/8088 processors
can make use of more than 1 MB of RAM, if one employs special hardware to make selectable
parts of it appear at addresses below 1 MB.
On a 286 or better PC equipped with more than 640 kB of RAM, the additional memory would
generally be re-mapped above the 1 MB boundary, since the IBM PC architecture reserves
addresses between 640 kB and 1 MB for system ROM and peripherals.
Extended memory is not accessible in real mode (except for the a small portion called the high
memory area). Only applications executing in protected mode can use extended memory directly.
A supervising protected-mode operating system such as Microsoft Windows manages application
programs access to memory. The processor makes this memory available through the Global
Descriptor Table (GDT) and one or more Local Descriptor Tables (LDTs). The memory is
"protected" in the sense that memory segments assigned a local descriptor cannot be accessed by
another program because that program uses a different LDT, and memory segments assigned a
global descriptor can have their access rights restricted, causing a processor exception (e.g., a
general protection fault or GPF) on violation. This prevents programs running in protected mode
from interfering with each other's memory.
A protected-mode operating system such as Microsoft Windows can also run real-mode programs
and provide expanded memory to them. The DOS Protected Mode Interface (DPMI) is Microsoft's
prescribed method for an MS-DOS program to access extended memory under a multitasking
environment.
[edit] eXtended Memory Specification (XMS)
The eXtended Memory Specification or XMS is the specification describing the use of IBM PC
extended memory in real mode for storing data (but not for running executable code in it).
Memory is made available by extended memory manager (XMM) software such as
HIMEM.SYS. The XMM functions are accessible through interrupt 2Fh.
XMS version 2.0 allow for up to 64 MiB of memory, with XMS version 3.0 this increased to 4
GiB. To differentiate between the possibly different amount of memory that might be available to
applications, depending on which version of the specification they were developed to, the latter
may be referred to as super extended memory or SXMS.
The extended memory manager is also responsible for managing allocations in the high memory
area (HMA) and the upper memory area (UMA; also referred to as upper memory blocks or
UMBs). In practice the upper memory area will be provided by the expanded memory manager
(EMM), after which DOS will try to allocate them all and manage them itself.
Memory above and beyond the standard 1MB (megabyte) of main memory that DOS supports. Extended
memory is only available in PCs with an Intel 80286 or later microprocessor.
Two types of memory can be added to a PC to increase memory beyond 1MB: expanded memory and
extended memory. Expanded memory conforms to a published standard called EMS that enables DOS
programs to take advantage of it. Extended memory, on the other hand, is not configured in any special
manner and is therefore unavailable to most DOS programs. However, MS-Windows and OS/2 can use
extended memory.
Conventional Memory
The first 640 KB of system memory is called conventional memory. The name refers to the fact
that this is where DOS, and DOS programs, conventionally run. Originally, this was the only place
that programs could run; today, despite much more memory being added to the PC, this 640 KB
area remains the most important in many cases. The reason is that without special software
support, DOS cannot run programs that are not in this special area. Conventional memory
occupies addresses 00000h to 9FFFFh.
Why is there a 640 KB "barrier"? The reason is the ill-fated decision by IBM to put the reserved
space for system functions above the memory dedicated for user programs, instead of below it, in
combination with two decades of preserving compatibility with the first processors, which couldn't
address over 1 MB of memory. This has caused the conventional memory area to be separated
from the rest of the usable memory in the PC (extended memory). If the reserved area had been
placed below conventional memory, it may have been possible to simply expand the conventional
memory space, but it still may not have happened (who knows).
As programs grew in size and complexity in the 1980s, programs strained to fit into their
increasingly tight 640 KB "clothing". The problem was exacerbated by the fact that the 640 KB
wasn't all available: parts of it are used by DOS routines themselves, along with interrupt vector
tables, buffers for file access, and drivers for CD-ROM drives, disk compression, and network
access. Because of this, many PC users found that after DOS had booted up, they typically had as
little as 550 or even 500 KB free for running programs. Meanwhile, program developers were
straining and squeezing to try to get their programs small enough for people to be able to use
them. The end result was usually a lot of problems. I have seen PCs that, without optimization,
boot up with less than 400 KB free of usable memory. Many programs will refuse to load with
less than 500 KB free.
To counter some of these problems, Microsoft made two enhancements to MS-DOS. The first is
the ability to load DOS itself into the high memory area, the first 64 KB of extended memory. Due
to a quirk in how the PC works, this area can be accessed by DOS. This saves about 45 KB of
conventional memory. The second is that it is possible to load some device drivers (not user
programs generally) into open portions of the upper memory area (the reserved 384 KB between
conventional memory and extended memory), so they aren't taking up space in conventional
memory. This has its own set of issues associated with it, as discussed here.
Maximizing conventional memory has become one of the "fine arts" of setting up a new PC that
uses DOS programs. Utilizing the high memory area and upper memory area, combined with
special tricks and experience, knowledgeable computer people can get all the drivers necessary to
run the system loaded with still as much as 620 KB free for programs. (You can never get all 640
KB free, as some space is always used for system functions). See here for ideas on optimizing
conventional memory.

Conventional memory is the first 640 kilobytes (640 × 1024 bytes) of the memory on IBM PC or
compatible systems. It is the read-write memory usable by the operating system and application
programs. As memory prices rapidly declined, this design decision became a limitation in the use
of large memory capacities until the introduction of operating systems and processors that made it
irrelevant.
Contents
[hide]
• 1 640 KB barrier
• 2 Additional memory
• 3 Memory
optimization
• 4 DOS extenders
• 5 See also
• 6 References
[edit] 640 KB barrier
The 640 KB barrier is an architectural limitation of IBM and IBM PC compatible PCs. The Intel
8088 CPU, used in the original IBM PC, was able to address 1 MB (220 bytes), as the chip offered
20 address lines. In the design of the PC, blocks of address space were reserved for the ROM
BIOS, additional read-only memory, BIOS extensions for fixed disk drives and video adapters,
video adapter memory, and other memory-mapped input and output devices. As originally
conceived, the memory below 640 KB was for random-access memory, and the 384 KB above
reserved for system use. Much of the 384 KB was unused but reserved in the event that it was
needed by optional devices. A similar reserved memory space exists in other system designs as
well; for the original Apple II computer, the CPU could address a total of 64 KB, with 48 KB
intended for RAM and the upper 16 KB reserved for system use.
At the time of the PC's release in 1981, 640 KB was more memory than the first IBM PC
motherboards could support, which were available with 16 KB to 64 KB or 64 KB to 256 KB
installed. It took a few years until most new PCs had even that much memory installed through
memory expansion cards. The most popular existing microcomputer when the PC appeared, the
Apple II+, had 64 KB in the most common configuration and could not be easily expanded
beyond this.
The design of the original IBM PC placed the Color Graphics Adapter memory map and other
hardware in the 384 KB upper memory area (UMA). The need for more RAM grew faster than the
needs of hardware to utilize the reserved addresses, which resulted in RAM eventually being
mapped into these unused upper areas to utilize all available addressable space. This introduced a
reserved "hole" (or several holes) into the set of addresses occupied by hardware that could be
used for arbitrary data. Avoiding such a hole was difficult and ugly and not supported by MS-
DOS or most programs that MS-DOS could run. Later, space between the holes would be used as
upper memory blocks (UMBs).
To maintain compatibility with older operating systems and applications, the 640 KB barrier
remained part of the PC design even after the 8088 had been replaced with the Intel 286 processor,
which could address up to 16 MB of memory. The 1 MB barrier also remained as long as the 286
was running in compatibility mode, as MS-DOS forced assumptions about how the segment and
offset registers overlapped such that addresses with more than 20 bits were unsupported. It is still
present in IBM PC compatibles today if they are running MS-DOS, and even in the most modern
Windows-based PCs the RAM still has a "hole" in the area between 640 and 1024 KBs, which
however is invisible to application programs thanks to paging and virtual memory.[citation needed]
A similar 3 GB barrier exists, which reduces 32-bit addressing from 4 GB to 3 GB on
motherboards that use memory mapped I/O. However, due to applications not assuming that the
3–4 GB range is reserved, there is no need to retain this addressing for compatibility, and thus the
barrier is easily removed by using a separate address bus for hardware, and only affects a
relatively small number of computers.
[edit] Additional memory
One technique used on early IBM XT computers was to ignore the Color Graphics Adapter (CGA)
memory map and push the limit up to the start of the Monochrome Display Adapter (MDA).
Sometimes software or a custom address decoder was used so that attempts to use the CGA went
to the MDA memory. This moved the barrier to 704 KB[1].
Memory managers on 386-based systems (such as QEMM or MemoryMax in DR-DOS) could
achieve the same effect, adding conventional memory at 640 KB and moving the barrier to 704
KB or 736 KB. Only CGA could be used in this situation, because Enhanced Graphics Adapter
(EGA) used this memory for the bit-mapped graphics buffer.
The AllCard, an add-on memory management unit for XT-class computers, allowed normal
memory to be mapped into the A0000-EFFFF (hex) address range, giving up to 952 KB for DOS
programs. Programs such as Lotus 1-2-3, which accessed video memory directly, needed to be
patched to handle this memory layout. Therefore, the 640 KB barrier was removed at the cost of
hardware compatibility.
It was also possible to use DOS's utility for console redirection, CTTY, to direct output to a dumb
terminal or another computer running a terminal emulator. The video card could then be removed
completely, and assuming the BIOS still permitted the machine to boot, the system could achieve
a total memory of 960 KB of RAM. This also required that the system have at least 2 MB of
physical memory in the machine. This procedure was tested on a 486 with IBM PC DOS 7.0.[citation
needed]
The total operating system footprint was around 20 KB, most of DOS residing in the high
memory area (HMA).
[edit] Memory optimization
As DOS applications grew larger and more complex in the late 1980s, it became common practice
to free up conventional memory by moving device drivers and Terminate and Stay Resident (TSR)
programs into upper memory blocks (UMBs) in the upper memory area (UMA) at boot, in order
to maximize the conventional memory available for applications. This had the advantage of not
requiring hardware changes, and preserved application compatibility. This feature began with DR-
DOS 5 and was later implemented in MS-DOS 5. The ability of MS-DOS versions 5.0 and later to
move their own system core code into the high memory area (HMA) through the DOS=HIGH
command gave another boost to free memory.
Most users used the accompanying EMM386 driver provided in DOS 5, but third-party products
from companies such as QEMM, also proved popular. The process of optimization sometimes
requires manual tuning of the process to produce the best results, in the sense of greatest free
conventional memory.
In MS-DOS 6, Microsoft introduced memmaker, which automated this process of block matching,
matching the functionality third-party memory managers offered. This still did not provide the
same result as doing it by hand.
[edit] DOS extenders
The barrier was only overcome with the arrival of DOS extenders, which allowed DOS
applications to run in extended memory, but these were not very widely used outside the computer
game area. The first PC operating systems to integrate such technology were Compaq DOS 3.31
(via CEMM) and Windows/386 2.1, both released in 1988. Since the 80286 version of Windows
2.0 (Windows/286), Windows applications did not suffer from the 640 KB barrier.
Prior to DOS extenders, if a user installed additional memory and wished to use it under DOS,
they would first have to install and configure drivers to support either expanded memory
specification (EMS) or extended memory specification (XMS).
EMS was a specification available on all PCs, including the Intel 8086 and Intel 8088 which
allowed add-on hardware to page small chunks of memory in and out of the "real mode"
addressing space. (0x0400–0xFFFF). This required that a hole in real memory be available,
typically (0xE000–0xEFFF). A program would then have to explicitly request the page to be
accessed before using it. These memory locations could then be used arbitrarily until replaced by
another page. This is very similar to modern virtual memory. However, in a virtual memory
system, the operating system handles all paging operations: the programmer, for the most part,
does not have to consider this.
XMS provided a basic protocol which allowed the client program to load a custom protected mode
kernel. This was available on the Intel 80286 and newer processors. The problem with this
approach is that while in protected mode, DOS calls could not be made. The work around was to
implement a callback mechanism. On the 286, this was a major problem. The Intel 80386, which
introduced "Virtual86 mode", allowed the guest kernel to emulate the 8086 and run the host
operating system without having to actually force the processor back into "real mode".
The latest DOS extension is DOS Protected Mode Interface (DPMI), a more advanced version of
XMS which provided many of the services of a modern kernel, obviating the need to write a
custom kernel. It also permitted multiple protected mode clients. This is the standard target
environment for the DOS port of the GCC compilers.
There are a number of other common DOS extenders. The most notable of which is the runtime
environment for the Watcom compilers, DOS/4GW, which was very common in games for DOS.
Such a game would consist of either a DOS/4GW 32-bit kernel, or a stub which loaded a
DOS/4GW kernel located in the path or in the same directory and a 32-bit "linear executable".
Utilities are available which can strip DOS/4GW out of such a program and allow the user to
experiment with any of the several, and perhaps improved, DOS/4GW clones.

You might also like