Simultaneous multithreading has already had impact in both the academic
and commercial communities. The project has produced numerous papers,
most of which have been published in journals or the top, journal-quality
architecture conferences, and one of which was the most recent paper
selected for the 25th Anniversary Anthology of the International Symposium
on Computer Architecture, a competition in which the criteria for
acceptance
was impact. The SMT project at the University of Washington has also spawned
other university projects in simultaneous multithreading. Lastly, several
U.S. chip manufacturers (Intel, Sun, Compaq when it still supported the Alpha
microprocessor line) are currently designing SMT processors for
future generations of their microprocessors. In addition, Clearwater Networks
is building an SMT network processor.
Our current smt research reexamines operating system design in the face of
several architectural features that are unique to SMT --
its cycle-by-cycle sharing of hardware resources among threads and its
hardware support for lightweight synchronization --
and extremely demanding request-driven parallel workloads, such as web
servers.
The research sits squarely between architecture and operating systems,
examining (1) the design and performance of SMT processors with respect to
their support for OS needs, and (2) the structure of operating systems in
light of the capabilities of multithreaded processors.
Overview and Impact
Simultaneous multithreading is a processor design that combines
hardware
multithreading with superscalar processor technology to allow multiple
threads to issue instructions each cycle. Unlike other hardware
multithreaded architectures (such as the Tera MTA), in which only a single
hardware context (i.e., thread) is active on any given cycle, SMT permits
all thread contexts to simultaneously compete for and share processor
resources. Unlike conventional superscalar processors, which suffer from
a lack of per-thread instruction-level parallelism, simultaneous
multithreading uses multiple threads to compensate for low single-thread
ILP. The performance consequence is significantly higher instruction
throughput and program speedups on a variety of workloads that include
commercial databases, web servers and scientific applications in both
multiprogrammed and parallel environments.
People
Faculty
Graduate students
Undergraduate students
Alumni
We have undoubtedly omitted groups doing SMT research.
If you would like your institution listed here or can supply us with a URL,
please email the contact below. Last modified on Wednesday, 09-Apr-2003 13:07:30 PDT Publications
Dean Tullsen,
Susan Eggers, and
Henry Levy,
Proceedings of the 22rd Annual International Symposium on Computer
Architecture, June 1995, pages 392-403.
This paper demonstrated the feasibility of simultaneous multithreading with
simulation-based speedups on several SMT machine models.
It was selected to appear in the 25th Anniversary Anthology of the
International Symposium
on Computer Architecture.
Dean Tullsen,
Susan Eggers,
Joel Emer,
Henry Levy,
Jack Lo,
and Rebecca Stamm
Proceedings of the 23rd Annual International Symposium on Computer
Architecture, May 1996, pages 191-202.
We designed SMT's microarchitecture, including a novel instruction fetch unit
that fetched instructions from the two "most profitable" threads each cycle.
In designing the microarchitecture, we met all three of our original design
goals: (1) that SMT exhibit increased throughputs when executing multiple
threads; (2) that SMT not degrade single-thread performance; and (3) that
SMT's implementation be a straightforward extension of current wide-issue,
out-of-order processor technology. The latter two criteria were necessary
to assure a smooth transition to the commercial world.
This paper was selected for Readings in Computer Architecture, ed.
M.D.
Hill, N.P. Jouppi and G.S. Sohi, Morgan Kaufman, 1999.
Jack Lo,
Susan Eggers,
Henry Levy, and
Dean Tullsen
Proceedings of the First SUIF Compiler Workshop, January 1996,
pages 146-7.
Jack Lo,
Susan Eggers,
Joel Emer,
Henry Levy,
Rebecca Stamm, and
Dean Tullsen
ACM Transactions on Computer Systems, August 1997, pages 322-354.
Single-chip multiprocessors (CMP, or 2 to 4 superscalar processors on a
single chip) are another emerging processor design that will likely compete
with simultaneous multithreading in the commercial market 3 to 4 years
from now. Our experiments indicate that SMT processors can outperform
CMPs when executing coarse-grain parallel programs (the natural workload
for CMPs) by an average of 60%. SMT has the performance advantage, because
it dynamically allocates hardware resources to whatever threads need them
at the time; the CMP, on the other hand, statically partitions these same
hardware resources for all threads, for all time. This paper carefully
quantifies CMP's loss of performance due to the static partitioning, on
a resource-by-resource basis. In the end, we show that even giving each
processor on the CMP the hardware resources of a single SMT processor
can't beat what dynamic partitioning buys SMT.
Susan Eggers,
Joel Emer,
Henry Levy,
Jack Lo,
Rebecca Stamm, and
Dean Tullsen
IEEE Micro, September/October 1997, pages 12-18.
This paper evaluate SMT with the experience of two years of design and
performance analysis under our belts. It describes the microarchitecture,
including the extended pipeline for accessing the multi-context register
file and the custom instruction fetch unit, and presented our most
current performance studies at the time, including a comparison to
wide-issue superscalars, traditional multithreading, and chip multiprocessors,
executing both a parallel and a multiprogrammed workload comprised of the
SPEC95 and SPLASH-2 applications. The paper was part of the IEEE Computer
special series on "how to use a billion transistors".
Jack Lo,
Susan Eggers,
Henry Levy,
Sujay Parekh, and
Dean Tullsen
Proceedings of the 30th Annual International Symposium on
Microarchitecture,
December 1997, pages 114-124.
Since simultaneous multithreading changes several fundamental architectural
assumptions on which many machine-dependent compiler optimizations are based,
such as the extent to which threads share the cache hierarchy and the
importance of hiding latencies via code scheduling, it stands to reason
that compiler optimizations that rely on these assumptions may need to be
applied differently on an SMT. We validated this hypothesis for three
optimizations that are commonly used and normally very profitable:
loop distribution, loop tiling and software speculation. We found that
when compiling programs for SMT, the optimizations either had to be
coupled with radically different policies than are currently used, or not
used at all.
Jack Lo,
Luiz Barroso,
Susan Eggers,
Kourosh Gharachorloo,
Henry Levy, and
Sujay Parekh
Proceedings of the 25th Annual International Symposium on Computer
Architecture (ISCA'98), June 1998, pages 39-50.
In addition to the multiprogramming and parallel workloads, we also studied
SMT executing a commercial database workload (Oracle) to gauge its performance
as a database server. Commercial workloads are a challenge for all computers,
because they have extremely bad memory subsystem performance. We devised
operating systems and applications-level mechanisms that improve SMT's
performance on databases by reducing inter-thread conflicts in the memory
hierarchy. The result was a 3-fold improvement over a wide-issue superscalar
when executing Oracle transactions.
Dean Tullsen,
Jack Lo,
Susan Eggers, and
Henry Levy
Proceedings of the 5th International Symposium on High Performance
Computer Architecture,
January 1999, pages 54-58.
The efficiency of a processor's synchronization mechanism determines the
granularity of parallelism of programs that run on it. Synchronization
on conventional multiprocessors is fairly costly, because communication
among the parallel threads must take place via memory. Consequently,
applications must be parallelized on a fairly coarse-grain level. Because
parallel threads are resident on an SMT processor, synchronizing them can
be done locally (on the processor) rather than through memory. SMT
synchronization is sufficiently light-weight that it both improves the
synchronization performance of current, coarse-grain parallel programs
and permits fine-grain parallelization of new codes that cannot be
parallelized with current synchronization mechanisms.
Jack Lo,
Sujay Parekh,
Susan Eggers,
Henry Levy, and
Dean Tullsen
IEEE Transactions on Parallel and Distributed Systems,
September 1999, pages 922-933.
We designed operating systems and compiler-directed architectural techniques
to deallocate SMT registers earlier than can be done with current register
renaming hardware. The mechanisms free idle hardware contexts (the thread
has terminated) and registers in active contexts after their last use. The
performance consequence is either a reduction in register file size (useful
if the register file access determines the processor cycle time) or an
increase in performance for a given file size. The compiler-based techniques
have much wider applicability than SMT processors -- they also improve
performance on any out-of-order processor.
Patrick Crowley,
Marc E. Fiuczynski
Jean_Loup Baer, and
Brian N. Bershad.
Proceedings of the 2000 International Conference on Supercomputing,
May 2000.
This paper characterizes current and future network processor application
workloads and concludes that SMT is better suited to a network node
environment
than aggressive out-of-order superscalars, fine-grain multithreaded processors
and chip multiprocessors (CMP).
Josh Redstone,
Susan Eggers, and
Henry Levy.
Proceedings of the 9th International Conference on Architectural Support
for Programming Languages and Operating Systems,
November 2000.
This paper presents our first analysis of operating systems execution on a
simultaneous multithreaded processor. To carry out this study, we modified
the Digital 4.0 Unix operating system to run on a simulated SMT CPU based
on a Compaq Alpha processor. We executed this environment by integrating
our SMT Alpha instruction set simulator into the SimOS machine simulator.
As our principle workload, we executed the Apache Web server running on
an 8-context SMT under Digital Unix. Our results demonstrate the
micro-architectural impact of an OS-intensive workload on a simultaneous
multithreaded processor, and provide insight into the OS demands
of the OS intensive Apache Web server.
Sujay Parekh,
Susan Eggers, and
Henry Levy.
University of Washington Technical Report, 2000.
This paper examines thread-sensitive scheduling for SMT processors. When more
threads exist than hardware execution contexts, the operating system is
responsible for selecting which threads to execute at any instant, inherently
deciding which threads will compete for resources. Thread-sensitive scheduling
uses thread-behavior feedback to choose the best set of threads to execute
together, in order to maximize processor throughput. We introduce several
thread-sensitive scheduling schemes and compare them to traditional oblivious
schemes, such as round-robin. Our measurements show how these scheduling
algorithms impact performance and the utilization of low-level hardware
resources. We also demonstrate how thread-sensitive scheduling algorithms can
be tuned to trade-off performance and fairness. For the workloads we measured,
we show that an IPC-based thread-sensitive scheduling algorithm can achieve
speedups over oblivious schemes of 7% to 15%, with minimal hardware costs.
Josh Redstone,
Susan Eggers, and
Henry Levy.
Proceedings of the International Conference on High-Performance Computer
Architecture,
February 2003, pages 19-30.
This paper presents the mini-thread architectural model for increasing
thread-level parallelism on SMT processors, particularly small-scale
implementations which, because of their size, can be thread-starved. It also
empirically demonstrates the performance trade-off for one implementation of
mini-threads, that in which the architectural register file is partitioned
among all executing threads in a hardware context, showing that the benefits
of the additional TLP far outweigh the cost of additional spill code generated
because each thread has fewer architectural registers available to it.
Luke K. McDowell,
Susan Eggers, and
Steven D. Gribble
.
Symposium on Principles and Practice of Parallel Programming ,
June, 2003.
This paper evaluates how SMT's hardware affects traditional support for server
software, in particular, memory allocation and synchronization, for three
different server models. The results demonstrate how a few simple changes to the
run-time libraries can dramatically boost multi-threaded server performance on
SMT, without requiring modifications to the applications themselves.
Steve Swanson,
Luke McDowell,
Michael Swift,
Susan Eggers,
Henry Levy.
Submitted for publication.
Jack Lo's thesis, 1998.
Josh Redstone's thesis, 2002
Funding
Our current SMT research is funded by an NSF ITR grant
CCR-0085670.
Past SMT research was wholly supported by NSF grant MIP-9632977,
with contributions by NSF grants CCR-9200832, and CCR-9632769,
DARPA grant F30602-97-2-0226,
ONR grants N00014-92-J-1395 and N00014-94-1-1136, and
the Washington Technology Center.
Industrial sponsors included Compaq Computer Corp. which donated both
workstations for simulations and the source to the Multiflow compiler,
and International Business Machines, Inc. with a Faculty Partnership Award.
Commercial Machines
MemoryLogix has announced an SMT
processor for mobile devices.
Related Projects
Other institutions have also done research in simultaneous multithreading.
These include:
This page maintained by Susan Eggers
[email protected]