Simultaneous Multithreading home page

The Wayback Machine - https://web.archive.org/web/20031009095029/http://www.cs.washington.edu:80/research/smt/

Simultaneous multithreading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources. Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP. The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multiprogrammed and parallel environments.

Simultaneous multithreading has already had impact in both the academic and commercial communities. The project has produced numerous papers, most of which have been published in journals or the top, journal-quality architecture conferences, and one of which was the most recent paper selected for the 25th Anniversary Anthology of the International Symposium on Computer Architecture, a competition in which the criteria for acceptance was impact. The SMT project at the University of Washington has also spawned other university projects in simultaneous multithreading. Lastly, several U.S. chip manufacturers (Intel, Sun, Compaq when it still supported the Alpha microprocessor line) are currently designing SMT processors for future generations of their microprocessors. In addition, Clearwater Networks is building an SMT network processor.

Our current smt research reexamines operating system design in the face of several architectural features that are unique to SMT -- its cycle-by-cycle sharing of hardware resources among threads and its hardware support for lightweight synchronization -- and extremely demanding request-driven parallel workloads, such as web servers. The research sits squarely between architecture and operating systems, examining (1) the design and performance of SMT processors with respect to their support for OS needs, and (2) the structure of operating systems in light of the capabilities of multithreaded processors.

People

Faculty

Susan Eggers

Hank Levy

Graduate students

Mike Swift

Undergraduate students

Aaron Eakin

Alumni

Brian Dewey (Microsoft Corp.)

Manu Thambi (Microsoft Corp.)

Dean Tullsen (UCSD)

Industrial collaborators (Compaq Corporation)

Luiz Barroso

Joel Emer

Kourosh Gharachorloo

Rebecca Stamm

Publications

Simultaneous Multithreading: Maximizing On-Chip Parallelism ( Abstract, Postscript)
Dean Tullsen, Susan Eggers, and Henry Levy,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
This paper demonstrated the feasibility of simultaneous multithreading with simulation-based speedups on several SMT machine models. It was selected to appear in the 25th Anniversary Anthology of the International Symposium on Computer Architecture.
Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor ( Abstract, Postscript)
Dean Tullsen, Susan Eggers, Joel Emer, Henry Levy, Jack Lo, and Rebecca Stamm
Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
We designed SMT's microarchitecture, including a novel instruction fetch unit that fetched instructions from the two "most profitable" threads each cycle. In designing the microarchitecture, we met all three of our original design goals: (1) that SMT exhibit increased throughputs when executing multiple threads; (2) that SMT not degrade single-thread performance; and (3) that SMT's implementation be a straightforward extension of current wide-issue, out-of-order processor technology. The latter two criteria were necessary to assure a smooth transition to the commercial world. This paper was selected for Readings in Computer Architecture, ed. M.D. Hill, N.P. Jouppi and G.S. Sohi, Morgan Kaufman, 1999.
Compilation Issues for a Simultaneous Multithreading Processor ( Postscript)
Jack Lo, Susan Eggers, Henry Levy, and Dean Tullsen
Proceedings of the First SUIF Compiler Workshop, January 1996, pages 146-7.
Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading ( Abstract, Postscript, PDF)
Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen
ACM Transactions on Computer Systems, August 1997, pages 322-354.
Single-chip multiprocessors (CMP, or 2 to 4 superscalar processors on a single chip) are another emerging processor design that will likely compete with simultaneous multithreading in the commercial market 3 to 4 years from now. Our experiments indicate that SMT processors can outperform CMPs when executing coarse-grain parallel programs (the natural workload for CMPs) by an average of 60%. SMT has the performance advantage, because it dynamically allocates hardware resources to whatever threads need them at the time; the CMP, on the other hand, statically partitions these same hardware resources for all threads, for all time. This paper carefully quantifies CMP's loss of performance due to the static partitioning, on a resource-by-resource basis. In the end, we show that even giving each processor on the CMP the hardware resources of a single SMT processor can't beat what dynamic partitioning buys SMT.
Simultaneous Multithreading: A Platform for Next-generation Processors ( PDF)
Susan Eggers, Joel Emer, Henry Levy, Jack Lo, Rebecca Stamm, and Dean Tullsen
IEEE Micro, September/October 1997, pages 12-18.
This paper evaluate SMT with the experience of two years of design and performance analysis under our belts. It describes the microarchitecture, including the extended pipeline for accessing the multi-context register file and the custom instruction fetch unit, and presented our most current performance studies at the time, including a comparison to wide-issue superscalars, traditional multithreading, and chip multiprocessors, executing both a parallel and a multiprogrammed workload comprised of the SPEC95 and SPLASH-2 applications. The paper was part of the IEEE Computer special series on "how to use a billion transistors".
Tuning Compiler Optimizations for Simultaneous Multithreading ( Abstract, Postscript, PDF)
Jack Lo, Susan Eggers, Henry Levy, Sujay Parekh, and Dean Tullsen
Proceedings of the 30th Annual International Symposium on Microarchitecture, December 1997, pages 114-124.
Since simultaneous multithreading changes several fundamental architectural assumptions on which many machine-dependent compiler optimizations are based, such as the extent to which threads share the cache hierarchy and the importance of hiding latencies via code scheduling, it stands to reason that compiler optimizations that rely on these assumptions may need to be applied differently on an SMT. We validated this hypothesis for three optimizations that are commonly used and normally very profitable: loop distribution, loop tiling and software speculation. We found that when compiling programs for SMT, the optimizations either had to be coupled with radically different policies than are currently used, or not used at all.
An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors ( Abstract, Postscript, PDF)
Jack Lo, Luiz Barroso, Susan Eggers, Kourosh Gharachorloo, Henry Levy, and Sujay Parekh
Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA'98), June 1998, pages 39-50.
In addition to the multiprogramming and parallel workloads, we also studied SMT executing a commercial database workload (Oracle) to gauge its performance as a database server. Commercial workloads are a challenge for all computers, because they have extremely bad memory subsystem performance. We devised operating systems and applications-level mechanisms that improve SMT's performance on databases by reducing inter-thread conflicts in the memory hierarchy. The result was a 3-fold improvement over a wide-issue superscalar when executing Oracle transactions.
Supporting Fine-Grain Synchronization on a Simultaneous Multithreaded Processor ( Postscript)
Dean Tullsen, Jack Lo, Susan Eggers, and Henry Levy
Proceedings of the 5th International Symposium on High Performance Computer Architecture, January 1999, pages 54-58.
The efficiency of a processor's synchronization mechanism determines the granularity of parallelism of programs that run on it. Synchronization on conventional multiprocessors is fairly costly, because communication among the parallel threads must take place via memory. Consequently, applications must be parallelized on a fairly coarse-grain level. Because parallel threads are resident on an SMT processor, synchronizing them can be done locally (on the processor) rather than through memory. SMT synchronization is sufficiently light-weight that it both improves the synchronization performance of current, coarse-grain parallel programs and permits fine-grain parallelization of new codes that cannot be parallelized with current synchronization mechanisms.
Software-Directed Register Deallocation for Simultaneous Multithreaded Processors ( Abstract, Postscript)
Jack Lo, Sujay Parekh, Susan Eggers, Henry Levy, and Dean Tullsen
IEEE Transactions on Parallel and Distributed Systems, September 1999, pages 922-933.
We designed operating systems and compiler-directed architectural techniques to deallocate SMT registers earlier than can be done with current register renaming hardware. The mechanisms free idle hardware contexts (the thread has terminated) and registers in active contexts after their last use. The performance consequence is either a reduction in register file size (useful if the register file access determines the processor cycle time) or an increase in performance for a given file size. The compiler-based techniques have much wider applicability than SMT processors -- they also improve performance on any out-of-order processor.
Characterizing Processor Architectures for Programmable Network Interfaces ( PDF)
Patrick Crowley, Marc E. Fiuczynski Jean_Loup Baer, and Brian N. Bershad.
Proceedings of the 2000 International Conference on Supercomputing, May 2000.
This paper characterizes current and future network processor application workloads and concludes that SMT is better suited to a network node environment than aggressive out-of-order superscalars, fine-grain multithreaded processors and chip multiprocessors (CMP).
An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture ( Abstract, Postscript, PDF)
Josh Redstone, Susan Eggers, and Henry Levy.
Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000.
This paper presents our first analysis of operating systems execution on a simultaneous multithreaded processor. To carry out this study, we modified the Digital 4.0 Unix operating system to run on a simulated SMT CPU based on a Compaq Alpha processor. We executed this environment by integrating our SMT Alpha instruction set simulator into the SimOS machine simulator. As our principle workload, we executed the Apache Web server running on an 8-context SMT under Digital Unix. Our results demonstrate the micro-architectural impact of an OS-intensive workload on a simultaneous multithreaded processor, and provide insight into the OS demands of the OS intensive Apache Web server.
Thread-Sensitive Scheduling for SMT Processors ( Abstract, Postscript, PDF)
Sujay Parekh, Susan Eggers, and Henry Levy.
University of Washington Technical Report, 2000.
This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execution contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. Thread-sensitive scheduling uses thread-behavior feedback to choose the best set of threads to execute together, in order to maximize processor throughput. We introduce several thread-sensitive scheduling schemes and compare them to traditional oblivious schemes, such as round-robin. Our measurements show how these scheduling algorithms impact performance and the utilization of low-level hardware resources. We also demonstrate how thread-sensitive scheduling algorithms can be tuned to trade-off performance and fairness. For the workloads we measured, we show that an IPC-based thread-sensitive scheduling algorithm can achieve speedups over oblivious schemes of 7% to 15%, with minimal hardware costs.
Mini-threads: Increasing TLP on Small-Scale SMT Processors ( Abstract, Postscript, PDF)
Josh Redstone, Susan Eggers, and Henry Levy.
Proceedings of the International Conference on High-Performance Computer Architecture, February 2003, pages 19-30.
This paper presents the mini-thread architectural model for increasing thread-level parallelism on SMT processors, particularly small-scale implementations which, because of their size, can be thread-starved. It also empirically demonstrates the performance trade-off for one implementation of mini-threads, that in which the architectural register file is partitioned among all executing threads in a hardware context, showing that the benefits of the additional TLP far outweigh the cost of additional spill code generated because each thread has fewer architectural registers available to it.
Improving Server Software Support for Simultaneous Multithreaded Processors ( Abstract, PDF)
Luke K. McDowell, Susan Eggers, and Steven D. Gribble .
Symposium on Principles and Practice of Parallel Programming , June, 2003.
This paper evaluates how SMT's hardware affects traditional support for server software, in particular, memory allocation and synchronization, for three different server models. The results demonstrate how a few simple changes to the run-time libraries can dramatically boost multi-threaded server performance on SMT, without requiring modifications to the applications themselves.
An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors ( Postscript, PDF)
Steve Swanson, Luke McDowell, Michael Swift, Susan Eggers, Henry Levy.
Submitted for publication.
Exploiting Thread-Level Parallelism on Simultaneous Multithreaded Processors ( Postscript)
Jack Lo's thesis, 1998.
An Analysis of Software Interface Issues for SMT Processors ( PDF)
Josh Redstone's thesis, 2002

Funding

Our current SMT research is funded by an NSF ITR grant CCR-0085670. Past SMT research was wholly supported by NSF grant MIP-9632977, with contributions by NSF grants CCR-9200832, and CCR-9632769, DARPA grant F30602-97-2-0226, ONR grants N00014-92-J-1395 and N00014-94-1-1136, and the Washington Technology Center. Industrial sponsors included Compaq Computer Corp. which donated both workstations for simulations and the source to the Multiflow compiler, and International Business Machines, Inc. with a Faculty Partnership Award.

Commercial Machines

MemoryLogix has announced an SMT processor for mobile devices.
Sun Microsystems has announced a 4-SMT-processor CMP.
Intel announced its first SMT processor, a two-hardware-context implementation (Xeon) which it expects to use initially in the server market. It also plans a follow-on chip that will contain two SMT processors.
Clearwater Networks , a Los Gatos-based startup, is (was?) building an 8-context SMT network processor with support for Layer 4 through 7 processing (the EETimes story ).
Compaq Computer Corp. designed a 4-context SMT processor, Alpha 21464 (EV-8) , for a future generation Alpha microprocessor line. The chip would have been the next Alpha to be produced had Compaq not scrapped the entire Alpha line in June of 2001 and would have been targeted for the server market. Joel Emer's Microprocessor Forum and PACT slides which introduced the SMT Alpha.
General news:
- From the Register .
- From Earthweb .
- From The Economist 4/20/01.
- From Ars Technica, 10/2/002.
- An article from IBM on Hyperthreading Speeds Linux (SMT support and performance on Linux and Intel Xeon).
- From EE Times on next-generation Intel chip with Hyperthreading.

Related Projects

Other institutions have also done research in simultaneous multithreading. These include:

University of California, San Diego

Compaq Corp.

Cornell University

IRISA
Department of Architecture, Barcelona
Institute of Informatics Systems, Russia/University of Arizona
Matsushita Electric Industrial Co., Ltd.
Michigan State University
University of California, Irvine
University of California, Santa Barbara
Universitat Karlsruhe
University of Michigan
University of Rochester
University of Texas, Austin
University of Wisconsin, Madison
We have undoubtedly omitted groups doing SMT research. If you would like your institution listed here or can supply us with a URL, please email the contact below.

This page maintained by Susan Eggers

[email protected]

Last modified on Wednesday, 09-Apr-2003 13:07:30 PDT

Aug	OCT	Nov
	09
2002	2003	2004