What Is A Pipeline Processor

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

What is a pipeline Processor All regular ChipGeek readers have undoubtedly read about the number of pipeline stages

each processor has. Their number and use are big factors in overall performance, and they can really speed-up or slow-down certain types of code. But what is a pipeline and why is it useful? The pipeline itself comprises a whole task that has been broken out into smaller sub-tasks. The concept actually has its roots in mass production manufacturing plants, such as Ford Motor Company. Henry Ford determined long ago that even though it took several hours to physically build a car, he could actually produce a car a minute if he broke out all of the steps required to put a car together into different physical stations on an assembly line. As such, one station was responsible for putting in the engine, another the tires, another the seats, and so on. Using this logic, when the car assembly line was initially turned on it still took several hours to get the first car to come off the end and be finished, but since everything was being done in steps or stages, the second car was right behind it and was almost completed when the first one rolled off. This followed with the third, fourth, and so on. Thus the assembly line was formed, and mass production became a reality. In computers, the same basic logic applies, but rather than producing something physical on an assembly line, it is the workload itself (required to carry out the task at hand) that gets broken down into smaller stages, called the pipeline. Consider a simple operation. Suppose the need exists to take two numbers and multiply them together and then store the result. As humans, we would just look at the numbers and multiply them (or, if they're too big, punch them into a calculator) and then write down the result. We wouldn't give much thought to the process, we would just do it. Computers aren't that smart; they have to be told exactly how to do everything. So, a programmer would have to tell the computer where the first number was, where the second number was, what operation to perform (a multiply), and then where to store the result. This logic can be broken down into the following (greatly simplified) stepsor stagesof the pipeline:

This pipeline has four stages. Now suppose that each of these logical operations took one clock cycle to complete (which is fairly typical in modern computers). That would mean the completed task of multiplying two numbers together would take four clock cycles to complete. However, with the ability to do things at the same time (in parallel) rather than one after another, the result can often be that while the task itself physically takes four clock cycles to complete, it can actually appear to be completed in fewer clock cycles because each of those stages can also be doing something immediately before and after the first task's needs are met. As a result, after each clock cycle the output of those operations are retired or completed, meaning that task is done. And, since we're doing things in a pipeline, that means that each task, taking four clock cycles to complete, can actually appear to be retired one per clock cycle. This concept can be visualized with colors added to the previous image and the stages broken out for each clock. Imagine each color representing a stage involved in processing a computer instruction, and that each takes four clock cycles to complete. The red, green, and dark blue instructions would've had other stages above our block, and the yellow, purple, and brown instructions would need additional clock cycles after our block to complete. But, as you can see, even with all of this going on simultaneously, after every single clock cycle an instruction (which actually took four clocks to execute) is completed! This is the big advantage of processing data via a pipeline.

This may seem a little confusing, so try to look at it this way. There are four units, and in every clock cycle each unit is doing something. You can visualize each unit doing its own bit of work with the following breakout:

Every clock cycle, each unit has something to do. And because each sub-task is known to only take one clock cycle, by the time the data from the first clock cycle gets ready to be processed next, it knows the data will be ready because, by definition, each unit has to complete its work in one clock cycle. If it doesn't then the processor isn't working like it's supposed to (this is one reason why you can only overclock CPUs so far and no further, even with great cooling). And because all of that stuff is working together, four-step instructions (or tasks) can be completed at a rate of one per clock. The advantages of this as a speed-up potential should be obvious, especially when you consider how many stages modern processors have (from 8 in Itanium 2 all the way up to 31 in Prescott!!). The terms Super Pipelined and Hyper Pipelined have become commonplace to describe the extent to which this breakout has been employed. Below is the pipeline for the Itanium 2. Each stage represents something that IA64 can do, and once everything gets rolling the Itanium 2 is able to process data really, really quickly. The problem with IA64 is that the compiler or assembly language programmer has to be extremely comprehensive to figure out the best way to keep all of those pipeline stages filled all of the time, because when they're not filled the Itanium's performance goes down significantly:

I was hoping to find an image showing Prescott's 31-stages, but I couldn't. The closest I found was a black-and-white comparison of the P6 core (Pentium III) and the original P7 core (Willamette). If anyone has a link showing Prescott's 31 stages, please let us know.

Here is an Opteron pipeline shown through actual logic units as they exist on the chip. This will help you visualize how the logical stages shown above for Itanium 2 might relate to physical units on the CPU die itself:

As you can see, there are different parts to the pipeline all working together, just like on an assembly line. They all relate to one another to do some real quantity of work. Some of it is front-end preparation, some of it is actual execution; and once everything is completed, parts are dedicated to retiring data or putting it back wherever it needs to go (main memory/cache or something called an internal register, which is like a super-fast cache inside of the processor itself, or an external data port, etc.). It's worth noting that the hyper-pipelined design of Intel's Netburst (used in Willamette through Prescott) has been found to be deadended when pushed to its extreme 31-stage pipeline in Prescott. The reason for this is a penalty that comes from mis-predicting where the computer program will go next. If the processor guesses wrong, it has to refill the pipeline, and that takes many clock cycles before any real work can start flowing again (just like how it takes several hours to make the first car). Another penalty is extreme heat generation at the high clock rates seen in Prescott-based P4s. As a result, real-world experience has shown that there is a trade-off between how deep your pipelinecan be and how deep it should be given the type of processing you're doing. Even though on paper it might seem a better idea to have a 50-stage pipeline with a 50GHz clock rate, a designer cannot simply go and build iteven though it would allow extremely complex tasks to be completed 50 billion times per second (though with GaAs chips on the way, that might now be possible). Chip designers can't do it because there are real-world constraints that mandate a happy medium between that ideal and the realworld actual. The most major factor is how the computer program jumps around constantly, calling sub-routines or functions, going over if..else..endif branches, looping, etc. The processor is constantly running the risk of guessing a branch wrong, and when it does it must invalidate everything it guessed on in the pipeline and begin to refill it completelyand that takes away time and lowers your performance. The imposed limitations on pipeline depth are simply the side-effect of running code via the facilities within a processor available to carry out the workload. A processor just can't do stuff the way a person can. Everything inside a CPU has to be programmed exactly as it needs to be, with absolutely no margin for error or guesswork. Any errorany error whatsoever, no matter how smallmeans the processor becomes totally and completely useless; it might as well not even exist. I hope this article has been informative. It should've given you a way to visualize a processor pipeline, understand why it is important to performance, and help you put together how it all works. You should be able to see why designs like Prescott (which take the pipeline depth to an extreme) often come at a real-world performance cost. You should also appreciate why slower-clocked processors (such as Itanium 2 at 1.8GHz) are able to do more work than much higher clocked processors (like Pentium 4 at 4GHz).

It's exactly because of the number of pipeline stages, coupled to the number of available units inside of the chip that can do things in parallel. The pipeline allows things to be done in parallel, and that means that a CPU's logic units are kept as busy as possible as often as possible to make sure that the instructions keep flying off the end at the highest rate possible. Keep in mind that there are several other factors that speed-up processing: processor concepts such as OoO (Out of Order) execution, speculative execution, the benefits of cache, etc. Stay tuned to ChipGeek for coverage of those, and keep your innergeek close by. Post your questions and comments below. Also, for your reading pleasure, here are some other online articles relating to pipelines and pipeline stages: Ars Technica on pipelining in general and Opteron's pipeline; some info on Prescott's die; and a history of Intel chips and their pipeline depths. This closing graphic will summarize the trend from the original 8086 through today's Pentium 4. Enjoy!

You might also like