Text preview for : Pentium IV Hyper-Threading.PDF part of Intel Pentium IV Hyper-Threading Intel Pentium IV Hyper-Threading.PDF



Back to : Pentium IV Hyper-Threadin | Home

Multiprogramming Performance of the Pentium 4 with
Hyper-Threading
James R. Bulpinand Ian A. Pratt

University of Cambridge Computer Laboratory
J J Thomson Avenue, Cambridge, UK, CB3 0FD.
Tel: +44 1223 331859.
[email protected]


Abstract 1 Introduction

Intel Corporation's "Hyper-Threading" technol-
Simultaneous multithreading (SMT) is a very fine ogy [6] introduced into the Pentium 4 [3] line of pro-
grained form of hardware multithreading that allows cessors is the first commercial implementation of si-
simultaneous execution of more than one thread with- multaneous multithreading (SMT). SMT is a form
out the notion of an internal context switch. The of hardware multithreading building on dynamic is-
fine grained sharing of processor resources means that sue superscalar processor cores [15, 14, 1, 5]. The
threads can impact each others' performance. main advantage of SMT is its ability to better utilise
processor resources and to hide memory hierarchy
Tuck and Tullsen first published measurements of the latency by being able to provide more independent
performance of the SMT Pentium 4 processor with work to keep the processor busy. Other architectures
Hyper-Threading [12]. Of particular interest is their for simultaneous multithreading and hardware mul-
evaluation of the multiprogrammed performance of tithreading in general are described elsewhere [16].
the processor by concurrently running pairs of single-
Hyper-Threading currently supports two heavy
threaded benchmarks. In this paper we present experi-
weight threads (processes) per processor, presenting
ments and results obtained independently that confirm
the abstraction of two independent logical processors.
their observations. We extend the measurements to
The physical processor contains a mixture of dupli-
consider the mutual fairness of simultaneously execut-
cated (per-thread) resources such as the instruction
ing threads (an area hinted at but not covered in detail
queue; shared resources tagged by thread number
by Tuck and Tullsen) and compare the multiprogram-
such as the DTLB and trace cache; and dynamically
ming performance of pairs of benchmarks running on
shared resources such as the execution units. The
the Hyper-Threaded SMT system and on a compara-
resource partitioning is summarised in table 1. The
ble SMP system.
scheduling of instructions to execution units is pro-
cess independent although there are limits on how
We show that there can be considerable bias in the many instructions each process can have queued to
performance of simultaneously executing pairs and try to maintain fairness.
investigate the reasons for this. We show that the
performance gap between SMP and Hyper-Threaded Whilst the logical processors are functionally in-
SMT for multiprogrammed workloads is often lower dependent, contention for resources will affect the
than might be expected, an interesting result given progress of the processes. Compute-bound processes
the obvious economic and energy consumption advan- will suffer contention for execution units while pro-
tages of the latter. cesses making more use of memory will contend for
use of the cache with the possible result of increased
capacity and conflict misses. With cooperating pro-
James Bulpin is funded by a CASE award from Marconi cesses the sharing of the cache may be useful but for
Corporation plc. and EPSRC two arbitrary processes the contention may have a
Duplicated Shared Tagged/Partitioned
Fetch ITLB Microcode ROM Trace cache
Streaming buffers
Branch Return stack buffer Global history array
prediction Branch history buffer
Decode State Logic uOp queue (partitioned)
Execute Register rename Instruction schedulers Retirement
Reorder buffer
(up to 50% use per thread)
Memory Caches DTLB

Table 1: Resource division on Hyper-Threaded P4 processors.


negative effect.