1.1. Computers & Processors:: Chapter-1

CHAPTER-1 INTRODUCTION
1.1. COMPUTERS & PROCESSORS: Computers are machines that perform tasks or calculations according to a set of instructions, or programs. The first fully electronic computer ENIAC (Electronic Numerical Integrator and Computer) introduced in the 1946s was a huge machine that required teams of people to operate. Compared to those early machines, today's computers are amazing. Not only are they thousands of times faster, they can fit on our desk, on our lap, or even in our pocket.
Computers work through an interaction of hardware and software. Hardware refers to the parts of a computer that we can see and touch, including the case and everything inside it. The most important piece of hardware is a tiny rectangular chip inside our computer called the central processing unit (CPU), or microprocessor. It's the "brain" of your computer the part that translates instructions and performs calculations. Hardware items such as your monitor, keyboard, mouse, printer, and other components are often called hardware devices, or devices. Software refers to the instructions, or programs, that tell the hardware what to do. A word-processing program that you can use to write letters on your computer is a type of software. The operating system (OS) is software that manages your computer and the devices connected to it. Windows is a well-known operating system.
Processors: Processors are said to be the brain of a computer system which tells the entire system what to do and what not to. It is made up of large number of transistors typically integrated onto a single die. In computing, a processor is the unit that reads and executes program instructions, which are fixed-length (typically 32 or 64 bit) or variablelength chunks of data.
The data in the instruction tells the processor what to do. The instructions are very basic things like reading data from memory or sending data to the user display, but they are processed so rapidly that we experience the results as the smooth operation of a program. Processors were originally developed with only one core.
Multi-Core Processor Technology
Page 1
A core is a part of processor that actually performs 1. Fetching 2. Decoding 3. Executing an instruction as shown in Fig 1.
Fig. 1.1(Single Core Computer) A single core processor can process only one instruction at a time. To improve the efficiency, processor commonly utilizes pipelines internally, which allow several instructions to be processed together. However they are still consumed into the pipeline at a time
. Fig.1.2 (Single Core) 1.2. A BRIEF HISTORY OF MICROPROCESSORS: Intel manufactured the first microprocessor, the 4-bit 4004, in the early 1970s which was basically just a number-crunching machine. Shortly afterwards they developed the 8008 and 8080, both 8-bit, and Motorola followed suit with their 6800 which was equivalent
Multi-Core Processor Technology Page 2
to Intels 8080. The companies then fabricated 16-bit microprocessors, Motorola had their 68000 and Intel the 8086 and 8088; the former would be the basis for Intels 80386 32-bit and later their popular Pentium line-up which were in the first consumer-based PCs.
Fig 1.3. Worlds first Single Core CPU Moores Law: One of the guiding principles of computer architecture is known as Moores Law. In 1965 Gordon Moore stated that the number of Transistors on a chip wills roughly double each year (he later refined this, in 1975, to every two years). What is often quoted as Moores Law is Dave Houses revision that computer performances will double every 18 months. The graph in Figure 1.3 plots many of the early microprocessors briefly discussed in here:
Fig 1.4. Depiction of Moores Law
Page 3
As shown in Figure 1.4, the number of transistors has roughly doubled every 2 years. Moores law continues to reign; for example, Intel is set to produce the worlds first 2 billion transistor microprocessor Tukwila later in 2008. Houses prediction, however, needs another correction. Throughout the 1990s and the earlier part of this decade microprocessor frequency was synonymous with performance; higher frequency meant a faster, more capable computer. Since processor frequency has reached a plateau, we must now consider other aspects of the overall performance of a system: power consumption, temperature dissipation, frequency, and number of cores. Multicore processors are often run at slower frequencies, but have much better performance than a single-core processor because two heads are better than one.
Fig 1.5 Evolutions of Microprocessors
Page 4
CHAPTER-2 SINGLE CORE PROCESSORS: A STEP BEHIND

A single core processor is a processor which contains only one core. This kind of processor was the trend of early computing system. At a high level, the single core processor architecture consists of several parts: the processor core, two levels of cache, a memory controller (MCT), three coherent Hyper Transport (cHT) links, and a non-blocking crossbar switch that connects the parts together. A single-core Opteron processor design is illustrated in Figure 2.1. The cHT links may be connected to another processor or to peripheral devices. The NUMA design is apparent from the diagram, as each processor in a system has its own local memory, memory to which it is closer than any other processor. Memory commands may come from the local core or from another processor or a device over a cHT link. In the latter case the
command comes from the cHT link to the crossbar and from there to the MCT.
Fig 2.1 Single core processors block diagram The local processor core does not see or have to process outside memory commands, although some commands may cause data in cache to be invalidated or flushed from cache.
2.1 PAST EFFORTS TO INCREASE EFFIENCY: As touched upon above, from the introduction of Intels 8086 through the Pentium 4 an increase in performance, from one generation to the next, was seen as an increase in processor frequency. For example, the Pentium 4 ranged in speed (frequency) from 1.3 to 3.8
GHz over its 8 year lifetime. The physical size of chips decreased while the number of transistors per chip increased; clock speeds in-creased which boosted the heat dissipation across the chip to a dangerous level. To gain performance within a single core many techniques are used. Superscalar processors with the ability to issue multiple instructions concurrently are the standard. In these pipelines, instructions are pre-fetched, split into subcomponents and executed out-of-order. A major focus of computer architects is the branch instruction. Branch instructions are the equivalent of a fork in the road; the processor has to gather all necessary information before making a decision. In order to speed up this process, the processor predicts which path will be taken; if the wrong path is chosen the processor must throw out any data computed while taking the wrong path and backtrack to take the correct path. Often even when an incorrect branch is taken the effect is equivalent to having waited to take the correct path. Branches are also removed using loop unrolling and sophisticated neural network-based predict-tors are used to minimize the miss prediction rate. Other techniques used for performance enhancement include register renaming, trace caches, reorder buffers, dynamic/software scheduling, and data value prediction. There have also been advances in power- and temperature-aware architectures. There are two flavours of power-sensitive architectures: low-power and power-aware designs. Low-power architectures minimize power consumption while satisfying performance constraints, e.g. embedded systems where low-power and real-time performance are vital. Power-aware architectures maximize performance parameters while satisfying power constraints. Temperature aware design uses simulation to determine where hot spots lie on the chips and revises the architecture to decrease the number and effect of hot spots.
FIG 2.2 single core processor
Page 6
CHAPTER-3 NEED OF MULTI-CORE PROCESSORS

It is well-recognized that computer processors have increased in speed and decreased in cost at a tremendous rate for a very long time. This observation was first made popular by Gordon Moore in 1965, and is commonly referred to as Moores Law. Specifically, Moores Law states that the advancement of electronic manufacturing technology makes it possible to double the number of transistors per unit area about every 12 to 18 months. It is this advancement that has fueled the phenomenal growth in computer speed and accessibility over more than four decades. Smaller transistors have made it possible to increase the number of transistors that can be applied to processor functions and reduce the distance signals must travel, allowing processor clock frequencies to soar. This simultaneously increases system performance and reduces system cost.
All of this is well-understood. But lately Moores Law has begun to show signs of failing. It is not actually Moores Law that is showing weakness, but the performance increases people expect and which occur as a side effect of Moores Law. One often associates performance with high processor clock frequencies. In the past, reducing the size of transistors has meant reducing the distances between the transistors and decreasing transistor switching times. Together, these two effects have contributed significantly to faster processor clock frequencies. Another reason processor clocks could increase is the number of 2 transistors available to implement processor functions. Most processor functions, for example, integer addition, can be implemented in multiple ways. One method uses very few transistors, but the path from start to finish is very long. Another method shortens the longest path, but it uses many more transistors. Clock frequencies are limited by the time it takes a clock signal to cross the longest path within any stage. Longer paths require slower clocks.
Having more transistors to work with allows more sophisticated implementations that can be clocked more rapidly. But there is a down side. As processor frequencies climb, the amount of waste heat produced by the processor climbs with it. The ability to cool the processor inexpensively within the last few years has become a major factor limiting how fast a processor can go. This is offset, somewhat, by reducing the transistor size because smaller transistors can operate on lower voltages, which allows the chip to produce less heat. Unfortunately, transistors are now so small that the quantum behavior of electrons can
affect their operation. According to quantum mechanics, very small particles such as electrons are able to spontaneously tunnel, at random, over short distances. The transistor base and emitter are now close enough together that a measurable number of electrons can tunnel from one to the other, causing a small amount of leakage current to pass between them, which causes a small short in the transistor.
As transistors decrease in size, the leakage current increases. If the operating voltages are too low, the difference between a logic one and a logic zero becomes too close to the voltage due to quantum tunnelling, and the processor will not operate. In the end, this complicated set of problems allows the number of transistors per unit area to increase, but the operating frequency must go down in order to be able to keep the processor cool.
This issue of cooling the processor places processor designers in a dilemma. The approach toward making higher performance has changed. The market has high expectations that each new generation of processor will be faster than the previous generation; if not, why buy it? But quantum mechanics and thermal constraints may actually make successive generations slower. On the other hand, later generations will also have more transistors to work with and they will require less power.
Speeding up processor frequency had run its course in the earlier part of this decade; computer architects needed a new approach to improve performance. Adding an additional processing core to the same chip would, in theory, result in twice the performance and dissipate less heat; though in practice the actual speed of each core is slower than the fastest single core processor. In September 2005 the IEE Review noted that power consumption increases by 60% with every 400MHz rise in clock speed.
So, what is a designer to do? Manufacturing technology has now reached the point where there are enough transistors to place two processor cores - a dual core processor on a single chip. The trade-off that must now be made is that each processor core is slower than a single-core processor, but there are two cores, and together they may be able to provide greater throughput even though the individual cores are slower. Each following generation will likely increase the number of cores and decrease the clock frequency.
The slower clock speed has significant implications for processor performance, especially in the case of the AMD Opteron processor. The fastest dual-core Opteron
processor will have higher throughput than the fastest single-core Opteron, at least for workloads that are processor-core limited, but each task may be completed more slowly. The application does not spend much time waiting for data to come from memory or from disk, but finds most of its data in registers or cache. Since each core has its own cache, adding the second core doubles the available cache, making it easier for the working set to fit.
For dual-core to be effective, the work load must also have parallelism that can use both cores. When an application is not multi-threaded, or it is limited by memory performance or by external devices such as disk drives, dual-core may not offer much benefit, or it may even deliver less performance. Opteron processors use a memory controller that is integrated into the same chip and is clocked at the same frequency as the processor. Since dual-core processors use a slower clock, memory latency will be slower for dual-core Opteron processors than for single-core, because commands take longer to pass through the memory controller.
Applications that perform a lot of random access read and write operations to memory, applications that are latency-bound, may see lower performance using dual-core. On the other hand, memory bandwidth increases in some cases. Two cores can provide more sequential requests to the memory controller than can a single core, which allows the controller to intern eave commands to memory more efficiently.
Another factor that affects system performance is the operating system. The memory architecture is more complex, and an operating system not only has to be aware that the system is NUMA (that is, it has NonUniform Memory Access), but it must also be prepared to deal with the more complex memory arrangement. It must be dual-core-aware. The performance implications of operating systems that are dual-core-aware will not be explored here, but we state without further justification that operating systems without such awareness show considerable variability when used with dual-core processors. Operating systems that are dual-core-aware show better performance, though there is still room for improvement
CHAPTER-4 WHAT IS MULTICORE PROCESSOR?

4.1 TERMINOLOGY: The terms multi-core and dual-core most commonly refer to some sort of central processing unit (CPU), but are sometimes also applied to digital signal processors (DSP) and system-ona-chip (SoC). Additionally, some use these terms to refer only to multi-core microprocessors that are manufactured on the same integrated circuit die. These people generally refer to separate microprocessor dies in the same package by another name, such as multi-chip module. This article uses both the terms "multi-core and "dual-core" to reference microelectronic CPUs manufactured on the same integrated circuit, unless otherwise noted. In contrast to multi-core systems, the term multi-CPU refers to multiple physically separate processing-units (which often contain special circuitry to facilitate communication between each other). The terms many-core and massively multi-core sometimes occur to describe multi-core architectures with an especially high number of cores (tens or hundreds). Some systems use many soft microprocessor cores placed on a single FPGA. Each "core" can be considered a "semiconductor intellectual property core" as well as a CPU core. 4.2 MULTI-CORE BASICS: The following isnt specific to any one multicore design, but rather is a basic overview of multi-core architecture. Although manufacturer designs differ from one another, multicore architectures need to adhere to certain aspects. The basic configuration of a microprocessor is seen in Figure 4. Closest to the processor is Level 1 (L1) cache; this is very fast memory used to store data frequently used by the processor. Level 2 (L2) cache is just off-chip, slower than L1 cache, but still much faster than main memory; L2 cache is larger than L1 cache and used for the same purpose. Main memory is very large and slower than cache and is used, for example, to store a file currently being edited in Microsoft Word. Most systems have between 1GB to 4GB of main memory compared to approximately 32KB of L1 and 2MB of L2 cache. Finally, when data isnt located in cache or main memory the system must retrieve it From the hard disk, which takes exponentially more time than reading from the memory system. If we set two cores side-by-side, one can see that a method of communication between the cores, and to main memory, is necessary. This is usually accomplished
either using a single communication bus or an interconnection network. The bus approach is used with a shared memory model, whereas the interconnection network approach is used with a distributed memory model.
Fig 4.1. Generic modern Processor Configuration After approximately 32 cores the bus is overloaded with the amount of processing, communication, and competition, which leads to diminished performance; therefore, a communication bus has a limited scalability Thus in order to continue delivering regular performance improvements for general-purpose processors, manufacturers such as Intel and AMD have turned to multi-core designs, sacrificing lower manufacturing-costs for higher performance in some applications and systems. Multi-core architectures are being developed, but so are the alternatives. An especially strong contender for established markets is the further integration of peripheral functions into the chip.
Fig 4.2. Multi-core processor design The above two figures shows the actual implementation of multi-core processor with shared memory and distributed memory.
SUPER SCALAR PIPELINING A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor. It thereby allows faster CPU throughput than would otherwise be possible at the same clock rate. A superscalar architecture executes more than one instruction during a single pipeline stage by pre-fetching multiple instructions and simultaneously dispatching them to redundant functional units on the processor.
Fig 4.3 Simple superscalar pipeline Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed. Processor board of a CRAY T3e parallel computer with four superscalar Alpha processors. THREAD LEVEL PARALLELISM Thread level parallelism is a main applicable when we run multiple threads at once. This parallelism is mainly found in applications written for commercial severs like databases. Due to multi-threading the applications can tolerate high amounts of I/O and memory systems latency. While one thread is accessing the disk the other threads will do useful work. The advent of multicore processors is mainly due to the inroads made in thread level parallelism.
Page 12
CHAPTER-5 MULTI-CORE IMPLEMENTATION

5.1 ARCHITECTURE AND FLOOR PLANS OF MULTI CORES AThe processor or cores are implemented on a single die or chip .Each core has its own complete set of resources, and may share the on-die cache layers.
Fig 5.1(architecture of multicore processor) CORE: The individual processors that are implemented on the integrated die or chip are called the cores. The core is the part of the processor that actually performs the reading and executing of instructions. REGISTER FILE: A register file is an array of processor registers in a central processing unit (CPU). Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports. Such RAMs are distinguished by having dedicated read and write ports, whereas ordinary multiport SRAMs will usually read and write through the same ports. BUS: Back side bus connects the processor with cache memory. CACHE: Closest to the processor is Level 1 (L1) cache; this is very fast memory used to store data frequently used by the processor. Level2 (L2) cache is just off-chip, slower than L1 cache, but still much faster than main memory; L2 cache is larger than L1 cache and used for the same purpose. The cores not necessarily share the cache.
Page 13
CROSS BAR: Cross bar switch is a switch connecting multiple inputs to multiple outputs in a matrix manner. Here the cross bar switch is used to connect the system request queue (SRQ) and the integrated memory controller. Here it directly connects both CPU cores to the HyperTransport link, as well as the integrated memory controller, for I/O to and from the outside world. Think of it like a train-track switch - signals can pass to/from either core and the outside world, but not at the same time. HYPER TRANSPORT LINK: Hyper-transport technology is a technology for interconnection of computer processors.
It is a bidirectional serial/parallel high-bandwidth, low-latency point-to-point link. This is replacement for front side bus.
Fig 5.2[multicore processor implemented with 4 independent processor]
INTEGRATED MEMORY CONTROLLER: The memory controller is a digital circuit which manages the flow of data going to and from the main memory. SYSTEM REQUEST QUEUE: The System Request Queue provides an interface for the CPU cores to the crossbar, and it is what keeps things operating smoothly. The System Request Queue manages and prioritizes both CPU cores' access to the crossbar switch, minimizing contention for the system bus. The result is a very efficient use of system resources.
Page 14
CORE COMPONENTS PIPELINE: One widely accepted technique for improving the performance of serial software tasks is pipelining. Simply put, pipelining is the process of dividing a serial task into concrete stages that can be executed in assembly-line fashion. In order to gain the most performance increase possible from pipelining, individual stages must be carefully balanced so that no single stage takes a much longer time to complete than other stages. Deeper pipeline buys frequency at expense of increased cache miss penalty and lower instructions per clock. Shallow pipeline gives better instructions per clock at the expense of frequency scaling. Max. frequency per core requires deeper pipelines CACHE: With the rising gap between processor and memory speed, maximizing on-chip cache capacity is crucial to attaining good performance. Memory system designers employ hierarchies of caches to manage latency. Many of todays multicore processors assume
private L1 caches and a shared L2 cache. At some point, however, a single shared L2 cache will require additional levels in the hierarchy. One option designers can consider is implementing a physical hierarchy that consists of multiple clusters, where each cluster consists of a group of processor cores that share an L2 cache. The effectiveness of such a physical hierarchy, however, may depend on how well the applications map to the hierarchy. Cache size buys performance at expense of die size. Deep pipeline cache miss penalties are reduced by larger caches.
Floorplans for 4,8 and 16 core processors [assuming private caches]
NCU NCU
NCU NCU
NCU NCU
NCU NCU MC NCU NCU IOX
NCU NCU
SBF
NCU NCU
MC
MC
MC
IOX SBF IOX
MC
MC
MC
MC
IOX
SBF
IOX
MC
MC NCU NCU MC IOX NCU NCU NCU NCU
SBF
NCU NCU
MC
NCU NCU
NCUNCU
NCUNCU
Core
L2 Data
L2 Tag
P2P Link
Note that there are two SBFs for 16 core processor
Page 15
As with any technology, multicore architectures from different manufacturers vary greatly. Along with differences in communication and memory configuration another variance comes in the form of how many cores the microprocessor has. And in some multicore architecture different cores have different Devices Athlon TILE64. 5.2 INTEL & AMD DUAL-CORE PROCESSOR: Intel and AMD are the mainstream manufacturers of microprocessors. Intel produces many different flavours of multicore processors: the Pentium D is used in desktops, Core 2 Duo is used in both laptop and desktop environments, and the Xeon processor is used in servers. AMD has the Althon line-up for desktops, Turion for laptops, and Opteron for servers/workstations. Although the Core 2 Duo and Athlon 64 X2 run on the same platforms their architectures differ greatly. functions, hence they are heterogeneous.
Differences in architectures are discussed below for Intels Core 2 Duo, Advanced Micro 64 X2, Sony-Toshiba- IBMs CELL Processor, and finally Tileras
Fig 5.3 (a) Intel Core 2 Duo (b) AMD Athlon 64 X2 Figure 5 shows block diagrams for the Core 2 Duo and Athlon 64 X2, respectively. Both
the Intel and AMD popular in the market of Microprocessors. Both architectures are homogenous dual-core processors. The Core 2 Duo adheres to a shared memory model with private L1 caches and a shared L2 cache which provides a peak transfer rate of 96 GB/sec. If a L1 cache miss occurs both the L2 cache and the second cores L1 cache are traversed in parallel before sending a request to main memory. In contrast, the Athlon follows a distributed memory model with discrete L2 caches. These L2 caches share a system request interface, eliminating the need for a bus. The system request interface also connects the cores with an on-chip memory controller and an interconnect called Hyper Transport.
Hyper Transport effectively reduces the number of buses required in a system, reducing bottlenecks and increasing bandwidth. The Core 2 Duo instead uses a bus interface. The Core 2 Duo also has explicit thermal and power control units on-chip. There is no definitive performance advantage of a bus vs. an interconnect, and the Core 2 Duo and Athlon 64 X2 achieve similar communication protocol. 5.3 THE CELL PROCESSOR: A Sony-Toshiba-IBM partnership (STI) built the CELL processor for use in Sonys PlayStation 3,therefore, CELL is highly customized for gaming/graphics performance measures, each using a different
rendering which means superior processing power for gaming applications.
Fig 5.4. CELL processor The CELL Is a heterogeneous multicore processor consisting of nine cores, one Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs), as can be seen in Figure 5.With CELLs real-time broadband architecture, 128 concurrent
transactions to memory per processor are possible. The PPE is an extension of the 64-bit PowerPC architecture and manages the operating system and control functions. Each SPE has simplified instruction sets which use 128-bit SIMD instructions and have 256KB of local storage. Direct Memory Access is used to transfer data between local storage and main memory which allows for the high number of concurrent memory transactions. The PPE and SPEs are connected via the Element Interconnect Bus providing internal
communication. Other interesting features of the CELL are the Power Management
Unit and Thermal Management Unit. Power and temperature are fundamental concerns in
Page 17
microprocessor design. The PMU allows for power reduction in the form of slowing, pausing, or completely stopping a unit. The TMU consists of one linear sensor and ten digital thermal sensors used to monitor temperature throughout the chip and provide an early warning if temperatures are rising in a certain area of the chip. The ability to measure and account for power and temperature changes has a great advantage in that the processor should never overheat or draw too much power. TILERA TILE64: Tilera has developed a multicore chip with 64 homogeneous cores set up in a grid, shown in Figure 5.
Fig 5.5..Tilera An application that take advantage of additional cores will written to these run
far faster than if it were run on a single core. Imagine having a project to finish, but instead of having to work on it alone you have 64 people to work for you. Each processor has its own L1 and L2 cache for a total of 5MB on-chip and a switch that connects the core into the mesh network rather than a bus or interconnects. The TILE64 also includes on-chip memory and I/O controllers. Like the CELL processor, unused tiles (cores) can be put into a sleep mode to further decrease power consumption. The TILE64 uses a 3way VLIW (very long
instruction word) pipeline to deliver 12 times the instructions as a single-issue, singlecore processor. When VLIW is combined with the MIMD (multiple instructions, multiple data) processors, multiple operating systems can be run simultaneously and advanced Multimedia applications such as conferencing and video-on-demand can run efficiently.
5.4 SCALABILITY POTENTIAL OF MULTICORE PROCESSORS: Processors plug into the system board through a socket. Current technology allows for one processor socket to provide access to one logical core. But this approach is expected to change, enabling one processor socket to provide access to two, four, or more processor cores. Future processors will be designed to allow multiple processor cores to be contained inside a single processor module. For example, a tightly coupled set of dual processor cores could be designed to compute independently of each otherallowing applications to interact with the processor cores as two separate processors even though they share a single socket. This design would allow the OS to thread the application across the multiple processor cores and could help improve processing efficiency. A multicore structure would also include cache modules. These modules could either be shared or independent. Actual implementations of multicore processors would vary depending on manufacturer and product development over time. Variations may include shared or independent cache modules, bus implementations, and additional threading capabilities such as Intel HyperThreading (HT) Technology. A multicore arrangement that provides two or more low-clock speed cores could be designed to provide excellent performance while minimizing power consumption and delivering lower heat output than configurations that rely on a single high-clock-speed core. The following example shows how multicore technology could manifest in a standard server configuration and how multiple low-clock-speed cores could deliver greater performance than a single highclock-speed core for networked applications. This example uses some simple math and basic assumptions about the scaling of multiple processors and is included for demonstration purposes only. Until multicore processors are available, scaling and performance can only be estimated based on technical models. The example described in this article shows one possible method of addressing relative performance levels as the industry begins to move from platforms based on single-core processors to platforms based on multicore processors. Other methods are possible, and actual processor performance and processor scalability are tied to a variety of platform variables, including the specific configuration and application environment. Several factors can potentially affect the internal scalability of multiple cores, such as the system compiler as well as architectural considerations including memory, I/O, front side bus (FSB), chip set, and so on. For instance, enterprises can buy a dual-processor server today to run Microsoft Exchange and provide e-mail, calendaring, and messaging functions.
Dual-processor servers are designed to deliver excellent price/performance for messaging applications. A typical configuration might use dual 3.6 GHz 64-bit Intel Xeon processors supporting HT Technology. In the future, organizations might deploy the same application on a similar server that instead uses a pair of dual-core processors at a clock speed lower than 3.6 GHz. The four cores in this example configuration might each run at 2.8 GHz. The following simple example can help explain the relative performance of a low-clockspeed, dual-core processor versus a high-clock-speed, dual-processor counterpart. Dualprocessor systems available today offer a scalability of roughly 80 present for the second processor, depending on the OS, application, compiler, and other factors. That means the first processor may deliver 100 present of its processing power, but the second processor typically suffers some overhead from multiprocessing activities. As a result, the two processors do not scale linearlythat is, a dual system does not achieve a 200 present performance increase over a single-processor system but instead
provides approximately 180 present of the performance that a single-processor system provides.
Fig 5.6. Sample core speed and anticipated total relative power in a system using two single-core processors In this article, the single-core scalability factor is referred to as external, or socket-tosocket, scalability. When comparing two single-core processors in two individual sockets, the dual 3.6 GHz processors would result in an effective performance level of approximately 6.48 GHz (see Figure 5). For multicore processors, administrators must take into account not only socket-tosocket scalability but also internal, or core-to-core, scalability the scalability between multiple cores that reside within the same processor module. In this example, core-to-core
scalability is estimated at 70 present, meaning that the second core delivers 70 present of its processing power. Thus, in the example system using 2.8 GHz dual-core processors, each dualcore processor would behave more like a 4.76 GHz processor when the performance of the two cores2.8 GHz plus 1.96 GHzis combined. For demonstration purposes, this example assumes that, in a server that combines two such dual-core processors within the same system architecture, the socket to-socket scalability of the two dual core processors would be similar to that in a server containing
two single present scalability. This would lead to an effective performance level of 8.57 GHz (see Figure 5.).
Fig 5.. Sample core speed and anticipated total relative power in a system Using two dual-core processors To continue the example comparison by postulating that socket to socket scalability would be the same for these two architectures, a multicore architecture could enable greater performance than a single-core processor architecture, even if the processor cores in the multicore architecture are running at a lower clock speed than the processor cores in the single-core architecture. In this way, a multicore architecture has the potential to deliver higher performance than single-core architecture for enterprise applications. On-going progress in processor designs has enabled servers to continue delivering increased performance, which in turn helps fuel the powerful applications that support rapid business growth. However, increased performance incurs a corresponding increase in processor power consumptionand heat is a consequence of power use. As a result, administrators must determine not only how to supply large amounts of power to systems, but also how to
contend with the large amounts of heat that these systems generate in the data centre.
CHAPTER-6 MULTI-CORE CHALLENGES

Having multiple cores on a single chip gives rise to some problems and challenges. Power and temperature management are two concerns that can increase exponentially with the addition of multiple cores. Memory/cache coherence is another challenge, since all designs discussed above have distributed L1 and in some cases L2 caches which must be coordinated. And finally, using a multicore processor to its full potential is another issue. If programmers dont write applications that take advantage of multiple cores there is no gain, and in some cases there is a loss of performance. Application need to be written so that different parts can be run concurrently (without any ties to another part of the application that is being run simultaneously). 6.1 POWER AND TEMPERATURE: If two cores were placed on a single chip without any modification, the chip would, in theory, consume twice as much power and generate a large amount of heat. In the extreme case, if a processor overheats your computer may even combust. To account for this each design above runs the multiple cores at a lower frequency to reduce power consumption. To combat unnecessary power consumption many designs also incorporate a power control unit that has the authority to shut down unused cores or limit the amount of power. By powering off unused cores and using clock gating the amount of leakage in the chip is reduced. To lessen the heat generated by multiple cores on a single chip, the chip is architected so that the number of hot spots doesnt1 g row too large and the heat is spread out across the chip. As seen in Figure 5.5, the majority of the heat in the CELL processor is dissipated in the Power Processing Element and the rest is spread across the Synergistic Processing Elements. The CELL processor follows a common trend to build temperature monitoring into the system, with its one linear sensor and ten internal digital sensors 6.2 CACHE COHERENCE: Cache coherence is a concern in a multicore environment because of distributed L1 and L2 cache. Since each core has its own cache, the copy of the data in that cache may not always be the most up-to-date version. For example, imagine a dual-core processor where each core brought a block of memory into its private cache.
One core writes a value to a specific location; when the second core attempts to read that value from its cache it wont have the updated copy unless its cache entry is invalidated and a cache miss occurs. This cache miss forces the second cores cache entry to be updated. If this coherence policy wasnt in place garbage data would be read and invalid results would be produced, possibly crashing the program or the entire computer. In general there are two schemes for cache coherence, a snooping protocol and a directorybased protocol. The snooping protocol only works with a bus based system, and uses a number of states to determine whether or not it needs to update cache entries and if it has control over writing to the block. The directory-based protocol can be used on an arbitrary network and is, there-fore, scalable to many processors or cores, in contrast to snooping which isnt scalable. In this scheme a directory is used that holds information about which memory locations are being shared in multiple caches and which are used exclusively by one cores cache. The directory knows when a block needs to be updated or invalidated. Intels Core 2 Duo tries to speed up cache coherence by being able to query the second cores L1 cache and the shared L2 cache simultaneously. Having a shared L2 cache also has an added benefit since a coherence protocol doesnt need to be set for this level. AMDs Athlon 64 X2, however, has to monitor cache coherence in both L1 and L2 caches. This is sped up using the Hyper Transport connection, but still has more overhead than Intels model. 6.3 MULTITHREADING: The last, and most important, issue is using multithreading or other parallel processing techniques to get the most performance out of the multicore processor. With the possible exception of Java, there are no widely used commercial development languages with [multithreaded] ex-tensions. Also to get the full functionality we have to have program that support the feature of TLP. Rebuilding applications to be multithreaded means a complete rework by programmers in most cases. Programmers have to write applications with subroutines able to be run in different cores, meaning that data dependencies will have to be resolved or accounted for (e.g. latency in communication or using a shared cache). Applications should be balanced. If one core is being used much more than another, the programmer is not taking full advantage of the multi-core system. Some companies have heard the call and designed new products with multicore capabilities; Microsoft and Apples newest operating systems can run on up to 4 cores, for example
CHAPTER-7 OPEN ISSUES

There are some issues related to the multi-core CPUs 7.1 IMPROVED MEMORY SYSTEM With numerous cores on a single chip there is an enormous need for increased memory. 32-bit processors, such as the Pentium 4, can address up to 4GB of main memory. With cores now using 64-bit addresses the amount of addressable memory is almost infinite. An improved memory system is a necessity; more main memory and larger caches are needed for multithreaded multiprocessors. 7.2 SYSTEM BUS AND INTERCONNECTION NETWORKS: Extra memory will be useless if the amount of time required for memory requests doesnt improve as well. Redesigning the interconnection network between cores is a major focus of chip manufacturers. A faster network means communication and a lower latency in inter-core
memory transactions. Intel is developing their Quick path
interconnect, which is a 20bit wide bus running between 4.8 and 6.4 GHz; AMD s new Hyper Transport 3.0 is a 32-bit wide bus and runs at 5.2 GHz. A different kind of interconnect is seen in the TILE64s iMesh, which consists of five networks used to fulfill I/O and off-chip memory communication. Using five mesh networks gives the Tile architecture a per tile (or core) bandwidth of up to 1.28 Tbps (terabits per second). 7.3 PARALLEL PROGRAMMING In May 2007, Intel fellow Shekhar Borkar stated that The software has to also start following Moores Law, software has to double the amount of parallelism that it can support every two years. Since the number of cores in a processor is set to double every 18 months, it only makes sense that the software running on these cores takes this into account. Ultimately, programmers need to learn how to write parallel programs that can be split up and run concurrently on multiple cores instead of trying to exploit single-core hardware to increase parallelism of sequential programs. Developing software for multicore processors brings up some latent concerns. How does a programmer ensure that a high-priority task gets priority across the processor, not just a core? In theory even if a thread had the highest priority within the core on which it is running it might not have a high priority in the system as a whole. Another necessary tool for developers is debugging. However, how do we guarantee that the entire system stops and not just the core on which an application is running? These issues need to
be addressed along with teaching good parallel programming practices for developers. Once programmers have a basic grasp on how to multithread and program in parallel, instead of sequentially, ramping up to follow Moores law will be easier. STARVATION: If a program isnt developed correctly for use in a multicore processor one or more of the cores may starve for data. This would be seen if a single threaded application is run in a multicore system. The thread would simply run in one of the cores while the other cores sat idle. This is an extreme case, but illustrates the problem. With a shared cache, for example Intel Core 2 Duos shared L2 cache, if a proper replacement policy isnt in place one core may starve for cache usage and continually make costly calls out to main memory. The replacement policy should include stipulations for evicting cache entries that other cores have recently loaded. This becomes more difficult with an increased number of cores effectively reducing the amount of evict able cache space without increasing cache misses 7.4 HOMOGENOUS VS. HETEROGENEOUS CORE: Architects have debated whether the cores in a multicore environment should be homogeneous or heterogeneous, and there is no definitive answeryet.
Homogenous cores are all exactly the same: equivalent frequencies, cache sizes, functions, etc. However, each core in a heterogeneous system may have a different function, frequency, memory model, etc. There is an apparent trade-off between processor complexity and customization. All of the designs discussed above have used homogeneous cores except for the CELL processor, which has one Power Processing Element and eight Synergistic Processing Elements. Homogeneous cores are easier to produce since the same instruction set is used across all cores and each core contains the same hardware. But are they the most efficient use of multicore technology? Each core in a heterogeneous environment could have a specific function and run its own specialized instruction set. Building on the CELL example, a heterogeneous model could have a large centralized core built for generic processing and running an OS, a core for graphics, a communications core, an enhanced mathematics core, an audio core, a cryptographic core, and the list goes on. This model is more complex, but may have efficiency, power, and thermal benefits that outweigh its complexity. With major manufacturers on both sides of this issue, this debate will stretch on for years to come; it will be interesting to see which side comes out on top.
CHAPTER-8 MULTI-CORE ADVANTAGES

Although the most important advantage of having multi-core architecture is already been discussed i.e. better performance there are many more advantages of multi-core processors as: 8.1 POWER AND COOLING ADVANTAGES OF MULTICORE PROCESSORS: Although the preceding example explains the scalability potential of multicore processors, scalability is only part of the challenge for IT organizations. High server density in the data centre can create significant power consumption and cooling requirements. A multicore architecture can help alleviate the environmental challenges created by high-clock-speed, single core processors. Heat is a function of several factors, two of which are processor density and clock speed. Other drivers include cache size and the size of the core itself. In traditional architectures, heat generated by each new generation of processors has increased at a greater rate than clock speed. In contrast, by using a shared cache (rather than separate dedicated caches for each processor core) and low-clock-speed processors, multicore processors may help administrators minimize heat while maintaining high overall performance. This capability may help make future multicore processors attractive for IT deployments in which density is a key factor, such as high-performance computing (HPC) clusters, Web farms, and large clustered applications. Environments in which 1U servers or blade servers are being deployed today could be enhanced by potential power savings and potential heat reductions from multicore processors. Currently, technologies such as demand-based switching (DBS) are beginning to enter the mainstream, helping organizations reduce the utility power and cooling costs of
computing. DBS allows a processor to reduce power consumption (by lowering frequency and voltage) during periods of low computing demand. In addition to potential performance advances, multicore designs also hold great promise for reducing the power and cooling costs of computing, given DBS technology. DBS is available in single-core processors today, and its inclusion in multicore processors may add capabilities for managing power consumption and, ultimately, heat output. This potential utility cost savings could help accelerate the movement from proprietary platforms to energy-efficient industry-standard platforms.
8.2 SIGNIFICANCE OF SOCKETS IN A MULTICORE ARCHITECTURE: As they become available, multicore processors will require IT organizations to consider system architectures for industry-standard servers from a different perspective. For example, administrators currently segregate applications into single-processor, dualprocessor, and quad-processor classes. However, multicore processors will call for a
new mind-set that considers processor cores as well as sockets. Single-threaded applications that perform best today in a single-processor environment will likely continue to be
deployed on single-processor, single-core system architectures. For single-threaded applications, which cannot make use of multiple processors in a system, moving to a multiprocessor, multicore architecture may not necessarily enhance performance. Most of todays leading operating systems, including Microsoft Windows Server System and Linux variants, are multithreaded, so multiple single-threaded applications can run on a multicore architecture even though they are not inherently multithreaded. However, for multithreaded applications that is currently deployed on single-processor architectures because of cost constraints, moving to a single-processor, dual core architecture has the potential to offer performance benefits while helping to keep costs low. For the bulk of the network infrastructure and business applications that organizations run today on dual-processor servers, the computing landscape is expected to change over time. However, while it may initially seem that applications running on a dual-processor, single-core system architecture can migrate to a single-processor, dual-core system architecture as a cost-saving initiative, this is not necessarily the case. To maintain equivalent performance or achieve a greater level of performance, the dual processor applications of today will likely have to migrate to dual-socket, dual-core systems. Two sockets can be designed to deliver superior performance relative to a dual-socket, single-core system architecture, while also delivering potential power and cooling savings to the data centre. The potential to gradually migrate a large number of older dual-socket, single core servers to energy-efficient dual-socket, multicore systems could enable significant savings in power and cooling costs over time. Because higher-powered, dual-socket systems typically run applications that are more mission-critical than those running on lesspowerful, single-processor systems, organizations may continue to expect more
availability, scalability, and performance features to be designed for dual-socket systems relative to single-socket systemsjust as they do today. For applications running today on high-performing quad processor systems, a
transition to multicore technology is not necessarily an opportunity to move from four-
socket, four-core systems to dual-socket, four-core systems. Rather, the architectural change suggests that todays four-processor applications may migrate to four-socket systems with eight or potentially more processor coreshelping to extend the range of costeffective, industry standard alternatives to large, proprietary symmetric multiprocessing (SMP) systems. Because quad-processor systems tend to run more mission-critical applications in the data centre as compared to dual processor systems and single-processor systems, administrators can expect quad-processor platforms to be designed with the widest range of performance, availability, and scalability features across Dell
PowerEdge server offerings. When comparing relative processing performance of one generation of servers to the next, a direct comparison should not focus on the number of processor cores but rather on the number of sockets. However, the most effective comparison is ultimately not one of processors or sockets alone, but a thorough comparison of the entire platformincluding scalability, availability, memory, I/O, and other features. By considering the entire platform and all the computing components that participate in it, organizations can best match a platform to their specific application and business needs. 8.3 EVOLUTION OF SOFTWARE TOWARD MULTICORE TECHNOLOGY: Multicore processing continues to exert a significant impact on software evolution. Before the advent of multicore processor technology, both SMP systems and HT Technology motivated many OS and application vendors to design software that could take advantage of multithreading capabilities. As multicore processor-based systems enter the mainstream and evolve, it is likely that OS and application vendors will optimize their offerings for multicore architectures, resulting in potential performance increases over time through enhanced software efficiency. Most application vendors will likely continue to develop on industry-standard processor platforms, considering the power, flexibility, and huge installed base of these systems. Currently, 64-bit. Intel Xeon processors have the capability to run both 32-bit applications and 64-bit applications through the use of Intel Extended Memory 64 Technology (EM64T). The industry is gradually making the transition from a 32-bit standard to a 64-bit standard, and similarly, software can be expected to make the transition to take advantage of multicore processors over time. Applications that are designed for a multiprocessor or multithreaded environment can currently take advantage of multicore processor architectures. However, as
Page 28 Multi-Core Processor Technology
software becomes optimized for multicore processors, organizations can expect to see
overall application performance enhancements deriving from software innovations that take advantage of multicore-processor-based system architecture instead of increased clock speed. In addition, compilers and application development tools will likely become available to optimize software code for multi core processors, enabling long-term optimization and enhanced efficiency for multicore processorswhich also may help realize
performance improvements through highly tuned software design rather than a brute-force increase in clock speed. Intel is working toward introducing software tools and compilers to help optimize threading performance for both single-core and multicore architectures. Organizations that begin to optimize their software today for multicore system architecture may gain significant business advantages as these systems become mainstream over the next few years. For instance, todays dual Intel Xeon processor-based system with HT Technology can support four concurrent threads (two per processor). With the advent of dual-core Intel Xeon processors with HT Technology, these four threads would double to eight. An OS would then have eight concurrent threads to distribute and manage workloads, leading to potential performance increases in processor utilization and processing efficiency. 8.4 SINGLE CORE VS. MULTI-CORE: The table below shows a comparison of a single and multicore (8 cores in this case) processor used by the Packaging Research Centre at Georgia Tech. With the same source voltage and multiple cores run at a lower frequency we see an al-most tenfold increase in bandwidth while the total power consumption is reduced by a factor of four.
TABLE 8.1Single core vs. Dual Core
Page 29
8.5 COMMERCIAL INCENTIVES: Now-a-days the multi-core processors are becoming very popular. Here are some lists of multi-core processors that are being highly adopted
TABLE 8.2Market Reviews as on January, 2011 DISADVANTAGES OF MULTI CORE PROCESSING: Adjustments must be made to the operating systems and other software to accommodate the structure and function of CPU. The only efficient way to operate multiple applications is multi-threading, and some applications do not function well with this concept. Sharing of the same bus and memory causes increased bandwidth, limiting the speed of the process.
USES OF MULTI CORE PROCESSING: Multi-tasking, Multi-threading, virtually eliminates high latency while running several programs, or background applications such as anti-virus software. The use of silicon surface area is increased and hence making better use of supplies and driving down costs. The speed of normal processes is increased. More processing power than standard CPUs used.
Page 30
CHAPTER-9 CONCLUSIONTECHNOLOGY
Before multicore processors the performance increase from generation to generation was easy to see, an increase in frequency. This model broke when the high frequencies caused processors to run at speeds that caused increased power consumption and heat dissipation at detrimental levels. Adding multiple cores within a processor gave the solution of running at lower frequencies, but added interesting new problems. Multicore processors are architected to adhere to reasonable power consumption, heat dissipation, and cache coherence protocols. However, many issues remain unsolved. In order to use a multicore processor at full capacity the applications run on the system must be multithreaded. There are relatively few applications (and more importantly few programmers with the know-how) written parallelism. The memory systems with any level of
SHIFT
IN
FOCUS
TOWARD
MULTI-CORE
and interconnection net-works also need
improvement. And finally, it is still unclear whether homogeneous or heterogeneous cores are more efficient. With so many different designs (and potential for even more) it is nearly impossible to set any standard for cache coherence, interconnections, and layout. The greatest difficulty remains in teaching parallel programming techniques (since most programmers are so versed in sequential programming) and in redesigning
current applications to run optimally on a multicore system. Multicore processors are an important innovation in the microprocessor timeline. With skilled programmers capable of writing parallelized applications multicore efficiency could be increased dramatically. In years to come we will see much in the way of improvements to these systems. These improvements will provide faster programs and a better computing experience.
Page 31
10. REFERENCES
[1] Advanced microprocessors-A.K RAY & K.M.BHURCHANDI
[2] Web references:
http://en.wikipedia.org http://www.ieee.org http://www.intel.com/products/processor/ http://www.dualcoretechnology.com http://multicore.amd.com
R. Merritt, CPU Designers Debate Multi-core Future, EETimes Online, February 2008, R. Merritt, X86 http://www.eetimes.com/showArticle.jhtml?articleID=206105179 Cuts to the Cores, EETimes Online, September 2007,
http://www.eetimes.com/showArticle.jtml?articleID=202100022
Page 32

1.1. Computers & Processors:: Chapter-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.1. Computers & Processors:: Chapter-1

Uploaded by

Copyright:

Available Formats

CHAPTER-1 INTRODUCTION

Multi-Core Processor Technology

Multi-Core Processor Technology

Fig 1.4. Depiction of Moores Law

Fig 1.5 Evolutions of Microprocessors

Multi-Core Processor Technology

CHAPTER-2 SINGLE CORE PROCESSORS: A STEP BEHIND

FIG 2.2 single core processor

Multi-Core Processor Technology

CHAPTER-3 NEED OF MULTI-CORE PROCESSORS

CHAPTER-4 WHAT IS MULTICORE PROCESSOR?

Multi-Core Processor Technology

CHAPTER-5 MULTI-CORE IMPLEMENTATION

Multi-Core Processor Technology

Fig 5.2[multicore processor implemented with 4 independent processor]

Multi-Core Processor Technology

Floorplans for 4,8 and 16 core processors [assuming private caches]

NCU NCU MC NCU NCU IOX

IOX SBF IOX

MC NCU NCU MC IOX NCU NCU NCU NCU

Note that there are two SBFs for 16 core processor

Multi-Core Processor Technology

rendering which means superior processing power for gaming applications.

CHAPTER-6 MULTI-CORE CHALLENGES

CHAPTER-7 OPEN ISSUES

memory transactions. Intel is developing their Quick path

CHAPTER-8 MULTI-CORE ADVANTAGES

transition to multicore technology is not necessarily an opportunity to move from four-

Multi-Core Processor Technology

TABLE 8.1Single core vs. Dual Core

Multi-Core Processor Technology

and interconnection net-works also need

Multi-Core Processor Technology

[2] Web references:

http://en.wikipedia.org http://www.ieee.org http://www.intel.com/products/processor/ http://www.dualcoretechnology.com http://multicore.amd.com

Multi-Core Processor Technology

You might also like