Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Comparing the Performance of the FLUENT application on

Quad-core Processors vs. Dual-core Processors


Dave Field, Don Mize
Hewlett-Packard, 3000 Waterview Parkway, Richardson, TX 75080, USA

Introduction ................................................................................................................................................2
Configurations measured ............................................................................................................................3
Performance assumptions ........................................................................................................................3
Test results..................................................................................................................................................3
Performance using partially-populated servers .......................................................................................5
Performance Conclusions ........................................................................................................................7
Power Consumption vs. Performance ...........................................................................................................8
Conclusions ...............................................................................................................................................9
Acknowledgments ................................................................................................................................... 10
For more information ............................................................................................................................... 10

Introduction
For many years, the advances in computer design have followed Moores Law, which states that the
number of transistors on a single chip increases at a fixed rate. Recently, all of the major developers
of high-performance computers have adopted two architectural approaches to implement Moores
Law: Multi-core processor chips and larger caches on the processor chip.
A multi-core processor contains more than one CPU (also known as core). To the operating system
and the application software, each core functions as an independent CPU.
At the same time that multi-core processors have become universal, processor power consumption has
become a severe design constraint. Consequently, as the number of cores on a processor increases,
the clock speed remains flat or declines.
These new design approaches have ended a long-term trend. Before the advent of multi-core
processors, we could expect an application to run faster, CPU for CPU, on each succeeding
generation of a processor. Now, application developers and users must perform additional software
engineering in order to achieve application performance improvement.
Application performance is measured in many ways. These measurements include the runtime of a
single job in serial or in parallel, multi-job throughput performance, performance per core or per
processor, and price/performance ratios of compute clusters and applications.
Early in 2007, the HP High Performance Computing Division (HPCD) launched its Multi-core
Optimization Program. The programs goal is to investigate and implement performance improvement
techniques for HPC applications on HP servers that use multi-core processors. This performance
analysis of FLUENT, from ANSYS, Inc., is a part of the HPCD program.
We studied the performance of the FLUENT application on dual-core processors vs. quad-core
processors. We assembled clusters of servers with these two processor types. We connected the
compute servers with the two most popular network interconnects: Gigabit Ethernet (GigE) and
InfiniBand (IB).
TerminologyIn this document, processor describes the physical chip and cores are the CPUs on
each processor.

Configurations measured
We chose Intel Xeon processors to demonstrate multi-core processor effects, since both a dual-core
Xeon processor and a quad-core Xeon processor exist with nearly identical functionality other than the
number of cores. We tested compute clusters of HP ProLiant BL460c blade servers using these Intel
Xeon processors. The specific configurations follow:
BL460c, 2 dual-core Xeon 5160 3.0GHz processors, eight 1GB memory DIMMs in the text, this
server is referred to as 2p4c (2 processors with a total of 4 cores);
BL460c, 2 quad-core Xeon 5355 2.66GHz processors, eight 2GB memory DIMMsreferred to as
2p8c.
The blade server contained both GigE and InfiniBand (IB) DDR (dual data rate) switches.
ANSYS provided their Cavity scalable benchmark. This benchmark provided data sets in eight
different sizes100K, 250K, 500K, 1M, 2M, 4M, 8M, and 16M cells.
For each benchmark, we ran FLUENT at different levels of parallelism on the four configurations
dual-core processor and GigE, dual-core processor and IB, quad-core processor and GigE, and quadcore processor and IB. Our objective was to uncover trends that will assist customers in selecting the
best compute cluster configuration for various workloads.

Performance assumptions
One assumption about todays multi-core processors is that the dual-core processor runs high
performance computing (HPC) applications faster than the quad-core processor, core for core. There
are two reasons for this assumption:
1) the amount of memory bandwidth per core is higher for the dual-core processor, and
2) the maximum clock speed of the dual-core processor is usually higher than that of the quad-core
processor.
When we ran these tests, we used the fastest clock speeds available for each processor: 3.0GHz on
dual-core and 2.66GHz on quad-core. These speeds will change over time, but for any one processor
architecture, we expect that the clock speed will decrease as the number of cores increases. One
obvious goal of our testing was to find conditions in which a cluster of quad-core-based servers
outperforms a cluster of dual-core-based servers.

Test results
Many of our tests confirmed our expectations. The tests using the IB network showed that dual-corebased servers outperformed quad-core-based servers on a core for core basis. Also, FLUENT ran
faster using the IB network than with the GigE network.
However, we hypothesized that quad-core-based server clusters might have performance advantages
in clusters networked with GigE. In this configuration, each quad-core-based server is slower than its
dual-core counterpart, but only half the number of quad-core-based servers is required to achieve a
given number of CPUs. Parallel application performance depends on the relationship between
computation and communication, and the communication component is likely to be the performance
barrier on a GigE network. Using a smaller number of servers to achieve a given number of cores will
result in less communication on the GigE network.
The majority of HPC compute clusters use GigE networking, so this is an important area to study. In
fact, we demonstrated this advantage on many of the benchmarks: a single FLUENT job running in
parallel on a GigE-networked cluster achieved higher performance on the quad-core-based cluster
than the dual-core-based cluster. Single-job performance is not always the important metric of a
computer facility. But single-job performance can be critical to meeting project deadlines.

We ran a single job at different levels of parallelism: from 1-way (serial) to 64-way. For the reasons
above, the dual-core-based cluster was faster for low levels of parallelism. (We used all the available
cores in this test. Performance can be increased by partially-populating or leaving some cores on
each processor idle. This issue is analyzed later in the paper.)
One characteristic of all applications when run in parallel is that there is a level of parallelism that
provides the maximum performance. Past this point, at higher levels of parallelism, the performance
decreases. For any workload, it is valuable to know this maximum performance and to know the
shape of the performance curve. Given this information, the workload can be optimized for either
multi-job throughput performance or single-job performance.
Figure 1 is an example of the quad-core superiority for the 500-cell Cavity benchmark for GigEnetworked configurations. The result is dramatic. Up to 16-way-parallel, the dual-core-based cluster
has the higher performance but reaches its maximum performance at 16-way-parallel, achieving a
maximum FLUENT performance of about 11,000.
However, the quad-core-based server continues to increase in performance up to 32-way-parallel,
achieving a maximum FLUENT performance of 14,500. The disadvantage of this approach is that
more license tokens are used to achieve these higher performance values. But if a projects
performance requires rapid completion of a job, the quad-core-based server makes this performance
possible.

Figure 1. FLUENT performance vs. number of cores used to run a single job on GigE network.

2p8c vs. 2p4c GigE-networked clusters


for 500K-cell Cavity data set
16000

BL460c/2p4c/3.0GHz/GigE
BL460c/2p8c/2.66GHz/GigE

14000

Performance (bigger is better)

12000

10000

8000

6000

4000

2000

16

24

32

40

48

56

64

Number of cores

This effect occurs for many small and medium-sized data sets, but not every benchmark showed this
result. For the seven Cavity benchmarks we ran, Figure 2 shows the results for each benchmark. For
each benchmark, the maximum performance for the dual-core-based server cluster (2p4c servers) is
listed, along with the number of cores required to achieve this performance.
For the quad-core-based cluster (2p8c servers), the table shows the performance at this same number
of cores and also shows additional performance vs. number of cores up to the 2p8c maximum
performance. For example, in the 500K-cell data set: at 16-way-parallel, the 2p4c cluster outperforms
the 2p8c cluster. But the 2p8c cluster continues to increase in performance at 32-way-parallel. (Notice
that 32-way-parallel requires four 2p8c servers and that 16-way-parallel also requires four 2p4c
servers.)

Figure 2. Maximum performance vs. number of cores for 2p4c and 2p8c GigE clusters
Cavity - 100K cells
Nbr cores

BL460c/2p8c

36000

16

44883

Cavity - 2M cells
BL460c/2p4c
29288

Nbr cores
32

Cavity - 250K cells


BL460c/2p8c
16
32

BL460c/2p4c

3089

3355

Cavity - 4M cells

BL460c/2p4c

20093

BL460c/2p8c

18383

Nbr cores
32

BL460c/2p8c

BL460c/2p4c

1175

2049

22588
Cavity - 500K cells
BL460c/2p8c

16

9818

32

14521

Cavity - 8M cells

BL460c/2p4c
11042

BL460c/2p8c
32

BL460c/2p4c

530

1082

Cavity - 1M cells
BL460c/2p8c
32

BL460c/2p4c

7053

5468

Figure 3 shows the ratio of the maximum quad-core-based cluster performance versus the maximum
dual-core-based cluster performance. The number is > 1 if the quad-core-based cluster has the higher
maximum.

Figure 3. Ratio of maximum performance of 2p8c cluster vs. 2p4c cluster, using GigE network.
Max 2p8c Performance vs. Max 2p4c Ratio
Cavity - 100K cells
Cavity - 250K cells

1.53
1.23

Cavity - 500K cells

1.32

Cavity - 1M cells
Cavity - 2M cells

1.29
0.92

Cavity - 4M cells

0.57

Cavity - 8M cells

0.49

Performance using partially-populated servers


Since the Xeon quad-core processors cannot provide optimum scalability on FLUENT, it is useful to
measure the performance when some cores on the quad-core processor are idle.
If 2 of the 4 cores of each processor are used and two are idle, then the FLUENT performance is
nearly identical to the performance of the dual-core processors. If 3 of the 4 cores of each processor

are used and one is idle, FLUENT measurements provide unique performance and price-toperformance results.
We measured performance of the Cavity data sets on three GigE-networked cluster configurations.
For each of the Cavity data sets, the maximum FLUENT performance was obtained running 32-wayparallel for the fully-populated 2p4c and 2p8c clusters, and running 24-way-parallel for the populated 2p8c cluster.
For 32-way-parallel jobs on fully-populated clusters, eight 2p4c servers and four 2p8c servers are
required. For 24-way-parallel jobs on the -populated 2p8c cluster, four servers are required. We
will name this configuration 2p6c in the following analysis. In Figure 4, for each data set, the chart
is normalized to 1.0 for the fully-populated 2p8c cluster.
First, compare the fully-populated 2p4c and 2p8c clusters. For smaller data sets, the 2p8c cluster
outperforms the 2p4c cluster on a core-for-core basissince for small data sets, the ratio of
computation to communication is low. On GigE-networked clusters, 2p8c clusters out perform in this
situation, since a smaller number of servers are used for a specific parallel level, which puts less
performance emphasis on the network. As the data set size increases, the 2p4c cluster greatly
increases in performance versus the 2p8c cluster.
Next, compare the partially-populated 2p6c cluster. This configuration provides more memory
bandwidth per core than the fully-populated 2p8c configuration, and the 2p6c per-core performance
is equal to or better than the fully-populated 2p8c configuration. The per-core performance of 2p6c
competes well with the 2p4c cluster for all but the largest data set.

Figure 4. Comparing FLUENT performance on fully-populated to partially-populated GigE clusters.

Comparison of FLUENT Performance -32-way-parallel using 2p4c and 2p8c servers and
24-way-parallel using 3 of 4 cores on each processor of 2p8c server
Cavity-16M cells

Cavity-8M cells

Cavity-4M cells

Cavity-2M cells

2p6c - 24-way using 3 of 4 cores

Cavity-1M cells

2p8c - 32-way-parallel
Cavity-500K cells

2p4c - 32-way-parallel

Cavity-250K cells
0

0.5

1.5

2.5

Relative performance (bigger is better)

Two components determine the price-to-performance ratio of a compute clusterthe compute cluster
and the application licenses.

First, consider the price-to-performance ratio of the compute cluster. The 32-core list price of the 2p4c
cluster is approximately 1.5 times more than the 2p8c cluster. Therefore, if 2p8c FLUENT
performance is at least 1.5 times greater than 2p4c, then the 2p8c cluster has a better price-toperformance ratio than the 2p4c cluster.
Next, consider the FLUENT license cost. Since the cost of a FLUENT job increases with the level of
parallelism, it is useful to compute the FLUENT performance on a per-core basis, as shown in Figure
5. The partially-populated 2p6c cluster outperforms both fully-populated configurations for all but the
largest data set.

Figure 5. Comparing FLUENT performance per core (performance divided by the level of parallelism) on fully-populated vs.
partially-populated GigE clusters.

FLUENT Performance/Core -32-way-parallel using 2p4c and 2p8c servers and


24-way-parallel using 3 of 4 cores on each processor of 2p8c server
Cavity-16M cells

Cavity-8M cells

Cavity-4M cells

Cavity-2M cells

Cavity-1M cells
2p6c - 24-way using 3 of 4 cores
Cavity-500K cells

2p8c - 32-way-parallel
2p4c - 32-way-parallel

Cavity-250K cells
0.00

0.50

1.00

1.50

2.00

2.50

3.00

Relative performance (bigger is better)

Performance Conclusions
Based on the above data, we can formulate several conclusions.
For large jobs that can scale out to a large number of processors, servers using dual-core
processors are the fastest. The large Cavity benchmarks are indicative of this.
For smaller jobs, the fastest configuration is the server using quad-core processors, with all cores
being used. This conclusion is proven with the Cavity benchmark up to one million cells.
A large set of mid-sized jobs perform best on servers using quad-core processors, using 3 out of 4
cores per processor. This class of job covers the Cavity benchmark from about one million cells to
four million cells.
Keep in mind though, that with different data sets and types of analysis your mileage may vary.

Power Consumption vs. Performance


Since power consumption is a primary reason for declines in clock speeds on multi-core processors, it
is useful to measure the power used to run applications. We examined several power-related
characteristics.
First, we measured the power for five of the FL5xx FLUENT standard performance benchmarks, run at
different levels of parallelism. The goal was to determine if the specific benchmark determined the
amount of power used. As shown in Figure 6, the amount of power used for a given configuration
(2p4c or 2p8c) depended on the level of parallelism, but it was roughly the same for the various data
sets. As a result, it is possible to determine a FLUENT power factor for each cluster configuration
which can predict the power usage versus the level of parallelism, independent of the data set. (The
maximum difference between any one data set and the average of all data sets was +/- 4%).

Figure 6. Power used (arbitrary units) by FLUENT standard benchmarks.


FLUENT Application -Power Utilization vs. Number of cores for 6 FLUENT standard benchmarks
900

FL5M1 2p4c
FL5M1 2p8c
FL5M2 2p4c
FL5M2 2p8c
FL5M3 2p4c
FL5M3 2p8c
FL5L1 2p4c
FL5L1 2p8c
FL5L2 2p4c
FL5L2 2p8c

800

700

Average Power

600

500

400

300

200

100

0
1

16

If power utilization is important to a customers computer facility, then power measurements can be
used in two ways to make the following decisions: (shown in Figure 7)
to determine the most power-efficient server configuration, and
to determine the most power-efficient job workload.
If FLUENT performance is divided by the average power used during the job, then this performanceto-power ratio indicates which configurations consume minimize power over the job duration.
In the figure, two benchmarks were performed on GigE clusters of 2p4c and 2p8c servers. For each
benchmark, it is possible to determine which server configuration provides the best performance-topower ratio. Also, for a given configuration, the figure shows the optimum levels of parallelism. The

level of parallelism with the highest bar uses the smallest amount of electrical energy (runtime x
average power) and results in the smallest electric bill.

Figure 7. FLUENT performance/power-utilization for parallel benchmarks on 2p4c and 2p8c clusters using GigE (arbitrary
units)higher is better.

FLUENT Performance/Power Ratio


Comparing GigE clusters of 2p8c vs. 2p4c
10.00

serial
2-way
4-way
8-way
16-way
32-way

Performance Rating/Avg Power (bigger is better)

9.00
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00

BL460c/2p8c

BL460c/2p8c
FL5M1

BL460c/2p8c

BL460c/2p8c
FL5L1

Conclusions
Power measurements and performance-per-power-utilization results can be useful in selecting powerefficient cluster configurations and in determining power-efficient workloads. Since power
consumption does not depend greatly on the FLUENT data set, it is possible to make decisions and
predict power usage for a wide range of FLUENT workloads.

Acknowledgments
The idea for this project originated in HPs High Performance Computing Division. It is one of the
results of HPs Multi-Core Optimization Program, which seeks ways to improve total application
performance and per-core application performance on servers using multi-core processors.

For more information


www.hp.com/go/hpc
www.fluent.com

2007 Hewlett-Packard Development Company, L.P. The information contained


herein is subject to change without notice. The only warranties for HP products and
services are set forth in the express warranty statements accompanying such
products and services. Nothing herein should be construed as constituting an
additional warranty. HP shall not be liable for technical or editorial errors or
omissions contained herein.
AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc. Intel and
Xeon are registered trademarks of Intel Corporation or its subsidiaries in the United
States and other countries. Itanium is a trademark or registered trademark of Intel
Corporation or its subsidiaries in the United States and other countries.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
Linux is a U.S. registered trademark of Linus Torvalds.
4AA1-6093ENW, November 2007

You might also like