Computer Architecture Slides 1

LECTURE 1
INTRODUCTION TO
COMPUTER ORGANISATION
Turbo Majumder
turbo@ee.iitd.ac.in
About Instructor and Course
Instructor: Dr. Turbo Majumder
Department of Electrical Engineering
Office: III-335
Email: turbo@ee.iitd.ac.in
Phone: 1073
Course webpage: http://web.iitd.ac.in/~turbo/EEL308_1302.htm
TAs: See course page for full list.
Textbook:
Computer Organization and Design: The Hardware/Software Interface,
ARM Edition, David A. Patterson, John L. Hennessy, Morgan
Kaufmann (Source of most material and figures in the lecture slides)
Class hours: Slot F: Tue, Thu, Fri: 11:00 11:50 am
Tutorial hours: 1:00 1:50 pm
Grading policy
Minor1: 20
Minor2: 20
Major: 30
Class participation
Term paper: 5
Quizzes: 15
Tutorial: 10
Attendance policy: As per Institute rules
Collaboration is good when it is open, honest and given
due credit. Clandestine collaboration invites an F.
Why learn computer architecture?
It is a core course, duh.
Computers (or if you will, microprocessors) are
everywhere. You will probably be designing or using one
in whatever job you do.
To design well, of course.
To use it well (e.g. programming), you need to know what is inside.
Plus, knowing this stuff gives you a geeky edge!
Where are computers used?
What you can hope to learn?
A. How does my computer understand the C-program I
have written?
B. Where does software and hardware interface in a
microprocessor? How does the interface look like?
C. What is performance? How can I characterise it? How
can I improve upon it?
D. Briefly, why do we need multicore processors and
parallel processing?
What impacts program performance?
Algorithm
Programming language, compiler and architecture
Processor and memory design
I/O interface design
We will look at all of these in terms of A, B, C and D
(previous slide).
How does my computer understand my
program?
Applications software
(browser, word processor,
media player)
Systems software (OS)
Computer hardware
(microprocessor)
Compiler
Assembler
Assembly
language
Machine
language
H
i
g
h
-
l
e
v
e
l

l
a
n
g
u
a
g
e
c = a + b;
ADD RC, RA, RB
0x40af8020
Only binary language,
please!
Instruction set
architecture
Basic components of a computer
Processor
Datapath
Control
Memory
Volatile
Non-volatile
I/O
Input
Output
Networking
LAN, WAN, WLAN
Moores Law
Source: Wikimedia
Commons
Moores Law again
Rachel Courtland, The Status of Moore's Law: It's Complicated, IEEE Spectrum, 28 Oct 2013 (based on data from
Global Foundries)
Moores Law: New dimensions?
AMDs Barcelona Architecture
Quad-core, 65 nm process
2007
(Courtesy: AnandTech)
Effect on number of
cores in a
microprocessor
Multicores
More on this later
Performance
Mostly concerned with time performance
Execution time
Performance = 1/(Execution time)
Important for individual applications/tasks
Improves (decreases) with faster processors
What is faster?
Higher clock speed?
Greater parallelism?
Computation throughput
Performance = No. of tasks/operations performed per second
Usually from different applications
Measured typically in GFLOPS, TFLOPS, ExaFLOPS
Important for server/cloud applications
Parallelism is key to getting these benefits.
Performance: Deep Dive
Relative performance:
Perf(X)/Perf(Y) = ExTime(Y)/ExTime(X)
Total execution time
Wall clock time, response time or elapsed time
CPU time
User CPU time
System CPU time
Difficult to separate these components
Use top command in Linux shell or Task Manager in Windows.
CPU Performance
CPU execution time
= CPU clock cycles per program X clock cycle time (clock period)
= CPU clock cycles per program / clock frequency
Program Set of (assembly/machine language)
instructions
CPU clock cycles per program
= Instructions per program X average clock cycles per instruction
= Instruction count (IC) X cycles per instruction (CPI)
CPU execution time
= IC X CPI X Tclk
= IC X CPI / fclk
CPU Performance: An example
Two programs foo and faa
Instruction types:
Instr_0 1 cycle
Instr_1 2 cycles
Instr_2 5 cycles
foo: total 10 instructions
Instr_0: 7
Instr_1: 2
Instr_2: 1
faa: total 8 instructions
Instr_0: 4
Instr_1: 2
Instr_2: 2
Total clock cycles
= 7*1+2*2+1*5 =16
Total clock cycles
= 4*1+2*2+2*5 =18
CPI details
Different instructions have different individual CPIs
Overall CPI is given by using a weighted average
foo: CPI = 1.6
faa: CPI = 2.25 Higher relative frequency of Instr_2
CPI
Clock Cycles
Instruction Count
CPI
i

Instruction Count
i
Instruction Count
i1
n
Relative frequency
of instruction i
Power: Problem with Moores Law
0.1
1
10
100
1,000
10,000
71 74 78 85 92 00 04 08
Power
(Watts)
4004
8008
8080
8085
8086
286
386
486
Pentium
processors
Power Projections Too High!
Hot Plate
Nuclear Reactor
Rocket Nozzle
Suns Surface
Source: Intel
Circumventing the power wall
P = CV
2
f
V: 5 V 1V; f: 30 MHz 3 GHz
We can reduce voltage and capacitive load
by only so much.
Other limitations in uniprocessors
Constrained by
power, instruction-
level parallelism
(ILP) and memory
latency
Moores Law: New approach
AMDs Barcelona Architecture
Quad-core, 65 nm process
2007
(Courtesy: AnandTech)
Increasing number of
cores in a processor to
be better prepared for
the power wall.
More processing done
in parallel at the same
clock frequency.
Age of multicore
processors
Multiprocessor trends
Larger number of cores
Better performance (speed, energy)
Greater complexity in design and application porting
Single-core
Dual-core 8-core
GPU
NoC
22
Benchmarking for performance
Standard Performance Evaluation Corporation (SPEC)
Integer (CINT2006) or Floating point (CFP2006)
Reference: Sun UltraSparc II system at 296MHz
Standard Performance Evaluation Corporation
info@spec.org
http://www.spec.org/
Page 2
spec
SPEC CINT2006 Result

Copyright 2006-2013 Standard Performance Evaluation Corporation
Cisco Systems
GHz)
Cisco UCS C220 M3 (Intel Xeon E5-2667 v2 @ 3.30
SPECint2006 = 68.1
SPECint_base2006 = 63.0
CPU2006 license: 9019 Test date: Sep-2013
Test sponsor: Cisco Systems Hardware Availability: Sep-2013
Tested by: Cisco Systems Software Availability: Aug-2013
Results Table
Benchmark Seconds Ratio Seconds Ratio Seconds Ratio
Base
Seconds Ratio Seconds Ratio Seconds Ratio
Peak
400.perlbench 263 37.2 263 37.2 264 37.1 210 46.5 210 46.5 210 46.5
401.bzip2 349 27.6 349 27.6 349 27.6 346 27.9 346 27.9 346 27.9
403.gcc 214 37.6 215 37.4 216 37.3 210 38.4 210 38.4 210 38.4
429.mcf 119 76.7 119 76.7 119 76.5 119 76.7 119 76.7 119 76.5
445.gobmk 367 28.6 367 28.6 366 28.7 331 31.7 331 31.6 331 31.7
456.hmmer 133 70.1 133 70.1 133 70.1 133 70.0 133 70.0 135 69.1
458.sjeng 359 33.7 359 33.7 388 31.2 352 34.4 352 34.4 352 34.4
462.libquantum 5.48 3780 5.88 3520 5.48 3780 5.48 3780 5.88 3520 5.48 3780
464.h264ref 400 55.4 399 55.4 398 55.6 327 67.6 327 67.7 327 67.7
471.omnetpp 165 37.8 174 35.9 168 37.2 116 53.7 115 54.3 116 53.8
473.astar 188 37.4 188 37.3 188 37.4 188 37.4 188 37.3 188 37.4
483.xalancbmk 103 67.1 103 67.2 103 67.3 104 66.3 104 66.5 103 66.7
Results appear in the order in which they were run. Bold underlined text indicates a median measurement.
Operating System Notes
Stack size set to unlimited using "ulimit -s unlimited"
Platform Notes
BIOS Settings:
Intel HT Technology = Disabled
CPU performance set to HPC
Power Technology set to Custom
CPU Power State C6 set to Enabled
CPU Power State C1 Enhanced set to Disabled
Energy Performance policy set to Performance
Memory RAS configuration set to Maximum Performance
DRAM Clock Throttling Set to Performance
LV DDR Mode set to Performance-mode
DRAM Refresh Rate Set to 1x
Sysinfo program /opt/cpu2006-1.2/config/sysinfo.rev6818
$Rev: 6818 $ $Date:: 2012-07-17 #$ e86d102572650a6e4d596a3cee98f191
running on linux-ygey Sat Aug 31 14:55:05 2013
This section contains SUT (System Under Test) info as seen by
some common utilities. To remove or add to this section, see:
http://www.spec.org/cpu2006/Docs/config.html#sysinfo
From /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
2 "physical id"s (chips)
16 "processors"
Continued on next page
Source:
www.spec.org
Benchmarking for power
Performance Power
Performance to
Power Ratio
Target
Load
Actual
Load
ssj_ops
Average
Active Power
(W)
100% 99.2% 28,593,082 3,239 8,828
90% 89.9% 25,917,132 2,765 9,372
80% 80.0% 23,041,812 2,499 9,221
70% 70.0% 20,156,576 2,289 8,805
60% 60.0% 17,296,778 2,061 8,393
50% 49.9% 14,392,213 1,848 7,787
40% 40.0% 11,531,023 1,671 6,902
30% 30.0% 8,645,852 1,494 5,788
20% 20.0% 5,766,436 1,322 4,363
10% 10.0% 2,879,100 1,151 2,501
Active Idle 0 688 0
ssj_ops / power = 7,525
SPECpower_ssj2008
Copyright 2007-2013 Standard Performance Evaluation Corporation
Dell Inc. PowerEdge M620 (Intel Xeon E5-2660 v2,
2.20 GHz)
SPECpower_ssj2008 = 7,525 overall
ssj_ops/watt
Test Sponsor: Dell Inc. SPEC License #: 55 Test Method: Multi Node
Tested By: Dell Inc. Test Location:
Round Rock, TX,
USA
Test Date: Sep 12, 2013
Hardware
Availability:
Sep-2013 Software Availability: Sep-2012 Publication: Oct 16, 2013
System Source: Single Supplier System Designation: Server
Power
Provisioning:
Line-powered
Benchmark Results Summary
Aggregate SUT Data
# of Nodes # of Chips # of Cores # of Threads Total RAM (GB) # of OS Images # of JVM Instances
16 32 320 640 384 16 320
System Under Test
Shared Hardware
Shared Hardware
Enclosure: Dell PowerEdge M1000e
Form Factor: 10U
Power Supply Quantity and
Rating (W):
3 x 2700
Power Supply Details: Dell PN G803N
Network Switch: Dell PowerConnect 6248
Network Switch Details: 48 Port 1Gb Ethernet Switch
KVM Switch: None
KVM Switch Details: N/A
Other Hardware:
Dell M1000e Chassis Management Controller, Dell P/N: JV95D; Dell 16-Port Gigabit Ethernet
Pass-Through Module, Dell P/N: WW060
Comment: Network switch not measured for power
Set: 'sut'
Set Identier: sut
Set Description: M620
# of Identical Nodes: 16
SPECpower_ssj2008 http://www.spec.org/power_ssj2008/results/res2013q4...
1 of 5 3-12-13 5:48 pm
Source:
www.spec.org
Improving performance
Make certain parts faster acceleration
e.g. graphics acceleration using GPU while playing computer
games
How much can we improve?
Amdahls Law:
Let T
o
= orignal execution time = T
a
(time that is subject to
acceleration) + T
u
(time unaffected by acceleration);
Acceleration = f Improved total time = T
i
Overall speedup = S
T
i
= T
a
/f + T
u
S = T
o
/T
i
= (T
a
+ T
u
)/(T
a
/f + T
u
)
If f , S 1 + T
a
/T
u
= 1/(fraction of total runtime unaffected by
acceleration)
Corollary: Make the common case faster.
What about idle power?
SPECPower results: 10% workload consumes more than
25% of peak power
Leakage power is a major concern with smaller
technologies.
Green data centres to have energy-proportional
computing
Barroso, L.A.; Holzle, U., "The Case for Energy-Proportional
Computing," Computer , vol.40, no.12, pp.33,37, Dec. 2007
Conclusion
Cost-performance-power tradeoff: Architects and designers
are slowly winning the game.
Hierarchical layers of abstraction
Both in software and hardware
Most important example of such abstraction:
Hardware-software interface Instruction set architecture
Performance
Execution time (seconds/program)
= (instructions/program)*(clock cycles/instruction)*(seconds/clock cycle)
Most critical resource: Energy
No longer area
Paradigm shift to multicores
Other parameters: reliability, scalability

Computer Architecture Slides 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture Slides 1

Uploaded by

Copyright:

Available Formats

LECTURE 1

SPEC CINT2006 Result

You might also like