COLLECTED BY
The IT History Society (ITHS) is a world-wide group of over 500 members working together to assist in and promote the documentation, preservation, cataloging, and researching of Information Technology (IT) history. We offer a place where individuals, academicians, corporate archivists, curators of public institutions, and hobbyists alike can gather and share information and resources. This catalog of resource sites concerning IT history is the only one of its kind and is a valuable resource for IT historians and archivists alike.
The Wayback Machine - https://web.archive.org/web/20121002210501/http://www.cs.clemson.edu/~mark/epic.html
Historical background for EPIC
Mark Smotherman.
Last updated: March 2012
Summary: The design style of EPIC (explicitly parallel instruction computing)
did not appear instantaneously, like Athena springing from Zeus' head. Instead,
EPIC is a compendium of ideas that have been percolating in computer
architecture for years.
See a partial writeup of this material in
M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf)
from ACM Southeast Conference, 2002.
Intel/HP EPIC - Explicitly Parallel Instruction Computing
There are several principles behind EPIC:
- start loads early
- predication to eliminate many conditional branches
- register rich
- independence architecture
- uncoupled branch architecture
- rotating register file
In the HP/Intel Itanium (IA-64), these influences are seen in the following
ways.
- start loads early
- advance loads - move above stores when alias analyis is incomplete
- speculative loads - move above branches
- predication to eliminate many conditional branches
- 64 predicate registers
- almost every instruction is predicated
- register rich
- 128 integer registers (64 bits each)
- 128 floating-point registers
- independence architecture
- VLIW flavor, but fully interlocked (i.e., no delay slots)
- three 41-bit instruction syllables per 128-bit "bundle"
- each bundle contains 5 "template bits" which specify independence
of following syllables (within bundle and between bundles)
- uncoupled branch architecture
- eight branch registers
- multiway branches
- rotating register files
- lower 48 of the predicate registers rotate
- lower 96 of the integer registers rotate
Sidebar: IA-64 History
- IA-64 joint ACM committee (architecture, compilers, microarchitecture)
- five Intel members:
- John Crawford (chief architect for overall effort)
- Hans Mulder (architecture)
- Harsh Sharangpani (microarchitecture and x86 floating point
compatibility)
- Kent Fielden (compilers)
- Jack Mills (architecture and performance evaluation)
- five HP members:
- Jerry Huck (lead architect for HP)
- Rajiv Gupta (architecture, Wide-Word background)
- David Fotland (microarchitecture, PA-RISC background)
- Dale Morris (architecture, PA-RISC background)
- Carol Thompson (compilers)
- Timeline
- 1981 -
Bob Rau leads Polycyclic Architecture project at TRW/ESL
- 1983 -
Josh Fisher describes ELI-512 VLIW design and trace scheduling
- 1983-1988 - Rau at Cydrome works on VLIW design called the Cydra-5,
but the company folds in 1988
- 1984-1990 -
Fisher at Multiflow works on VLIW design called the Trace,
but the company folds in 1990
- 1988 - Dick Lampman at HP hires Bob Rau and
Mike Schlansker from Cydrome and also gets IP rights from Cydrome
- 1989 - Rau and Schlansker begin the FAST (Fine-grained Architecture
and Software Technologies) research project at HP; they later develop
the HP PlayDoh architecture
- 1990-1993 - Bill Worley leads PA-WW (Precision Architecture Wide-Word)
effort at HP Labs to be the successor to the PA-RISC architecture;
it was also called SP-PA (Super-Parallel Processor Architecture)
and SWS (Super WorkStation)
- HP hires Josh Fisher, input to PA-WW
- input to PA-WW from Hitachi team, led by Yasuyuki Okada
- November 1991 - Hans Mulder joins Intel to start work on a 64-bit
architecture
- July 1992 - Worley recommends HP seek a semiconductor manufacturing
partner
- 1993 - HP starts effort to develop PA-WW as a product
- December 1993 - HP investigates partnership with Intel
- June 1994 - announcement of cooperation between HP and Intel;
PA-WW used as starting point for joint design; John Crawford of Intel
leads the joint team
- 1997 - the term EPIC is coined
- October 1997 - Microprocessor Forum presentations by Intel and HP
- July 1998 - Carole Dulong of Intel publishes
"The IA-64 Architecture at Work," IEEE Computer, pp. 24-32.
- February 1999 - release of ISA details of IA-64
- 2001 - Intel marketing prefers IPF (Itanium Processor Family) to IA-64
- May 2001 - Itanium (Merced)
- July 2002 - Itanium 2 (McKinley)
- References
-
Itanium history page at HPL
- Russ Britt,
"The Birth of a New Processor,"
Electronic Business, January 2000.
- Mike Schlansker and Bob Rau,
"EPIC: Explicitly Parallel Instruction Computing" (pdf),
IEEE Computer, February 2000, pp. 37-45.
- Mike Schlansker and Bob Rau,
"EPIC: An Architecture for Instruction-Level Parallel Processors"
(pdf),
HP Labs Technical Report HPL-1999-111, February 2000.
- John Crawford,
"Introducing the Itanium Processors,"
IEEE Micro, September-October 2000, pp. 9-11.
-
Itanium home page at Intel
- Wen-mei Hwu, et al., "Itanium Performance Insights"
(Univ. of Illinois, IMPACT compiler project),
Microprocessor Forum, 2001.
- Jay Bharadwaj, et al.,
"The Intel IA-64 Compiler Code Generator",
IEEE Micro, September-October 2000, pp. 44-53.
- Rumi Zahirm, Dale Morris, Jonathan Ross, and Drew Hess,
"OS and Compiler Considerations in the Design of the
IA-64 Architecture,"
ASPLOS-IX, November 2000, 212-221.
- Charles Gray, Matthew Chapman, Peter Chubb, David Mosberger-Tang,
and Gernot Heiser,
"Itanium - A System Implementor's Tale" (pdf),
USENIX Annual Technical Conference, April 2005, pp. 264-278.
- John Sias,
"A Systematic Approach to Delivering Instruction-Level
Parallelism in EPIC Systems" (pdf), Ph.D. Dissertation,
Univ. Illinois at Urbana-Champaign, 2005.
- see also the set of links (some now dead) collected for
"Itanium: An EPIC Architecture,"
CS 854 Advanced Computer Architecture class project,
Univ. of Virginia, 2001.
Historical precedents for load speculation
- several designs recognized the need for early initiation of loads
-
Zuse Z4, 1940s - the instruction stream was read two instructions
in advance, and if a load was detected it was started early to reduce
the impact of the slow cycle time of the memory.
-
IBM Stretch, 1961 - a separate indexing unit pre-processes the
instruction stream to decode arithmetic instructions and start memory
loads early (it also executes index-register-related operations and
branches); decoded instructions and data values loaded from memory
are placed in a four-element lookahead buffer between the indexing
unit and the arithmetic unit
- IBM S/360 Model 91, 1967; IBM S/370 Model 165, 1970; IBM 3033, 1978;
IBM 3090, 1986 - these IBM mainframes use overlapped I and E units in a
manner similar to Stretch: the I unit fetches and decodes instructions,
performs address calculations, and starts memory loads
- decoupled access/execute (DAE) architectures split the instruction
stream and allow the memory-access machine to run ahead of the execute
machine and to pre-load memory data into FIFO buffers
- the Multiflow /500 design recognized a need to start loads as early
as possible but also to discard ("crush") remaining, unnecessary
loads when "long loops are winding down" - Colwell, et al., paper
in Supercomputing '90
- speculative loads are started earlier than the control flow would
normally allow - question is what to do about exceptions
- Smith/Lam/Horowitz "Boosting ..." papers (ISCA, 1990; APSLOS, 1992)
- Rogers/Li "Software Support ..." paper (ASPLOS, 1992)
- Mahlke/Chen/Hwu/Rau/Schlansker "Sentinel Scheduling ..." paper
(ASLPOS, 1992)
-
pollution bits in HARP (1993)
- poison bits in Tera (199_)
- speculative loads on UltraSPARC (1995)
- DEC Alpha claims for doing speculative loads using normal
loads and OS mods (1997)
- see HP patents
- 5,692,169 - Method and system for deferring exceptions
generated during speculative execution
- 5,596,733 - System for exception recovery using a conditional
substitution instruction which inserts a replacement result in
the destination of the excepting instruction
- Smith, Lam, and Horowitz -- "Boosting Beyond Static Scheduling in a
Superscalar Processor," ISCA 1990
- "best aspects of static and dynamic scheduling"
- static branch prediction, encoded in branch op codes
- shadow register file and shadow store buffer
- move instructions up before one branch, mark as boosted, and access
shadow registers; any exceptions are deferred until boosted instruction
commits
- if branch prediction is correct:
- move results from shadow registers into registers
- any boosted instructions still in pipe are unmarked and accesses
to shadow registers are then changed to corresponding registers
- if branch prediction is incorrect:
- flush shadow structures
- squash any boosted instructions still in pipe
- speedups
|
(1 ld/st |
basic block |
---- fetch 2 ---- |
---- fetch 4 ---- |
|
per cycle) |
scheduling |
(no load/store reorg.) |
(load/store reorg.) |
|
"max speedup" |
only |
dynamic sched. |
boosting |
dynamic sched. |
boosting |
awk |
3.91 | 1.17 | 1.41 | 1.49 |
1.86 | 1.52 |
ccom |
3.03 | 1.11 | 1.41 | 1.52 |
1.97 | 1.57 |
espresso |
4.19 | 1.22 | 1.51 | 1.70 |
1.97 | 1.79 |
irsim |
2.84 | 1.11 | 1.42 | 1.55 |
1.94 | 1.64 |
latex |
2.88 | 1.16 | 1.43 | 1.56 |
1.95 | 1.63 |
- IBM VLIW efforts
- 33rd bit (to indicate the "bottom value") in registers
- extra bit in the instruction opcode to indicate a speculative or
non-speculative version; exceptions occurred when a non-speculative
instruction computed a bottom value
- K. Ebcioglu, "Some Design Ideas for a VLIW Architecture for Sequential
Natured Software," in Parallel Processing (Proceedings of IFIP WG 10.3
Working Conference on Parallel Processing), pp. 3-21, M. Cosnard et
al. (eds.), North Holland, 1988.
- similar design proposed an extra field in register to hold the address
of the excepting instruction
- K. Ebcioglu and R. Groves, "Some Global Compiler Optimizations and
Architectural Features for Improving Performance of Superscalars,"
Research Report no. RC16145, IBM T.J. Watson Research Center,
Yorktown Heights, NY, 1990.
- IBM patents
- 5,542,075 - Method and apparatus for improving performance of out
of sequence load operations in a computer system
- 5,625,835 - Method and apparatus for reordering memory operations
in a superscalar or very long instruction word processor
- 5,799,179 - Handling of exceptions in speculative instructions
- precise exceptions from compiler techniques
- G.M. Silberman and K. Ebcioglu, "An Architectural Framework for
Supporting Heterogeneous Instruction-Set Architectures," IEEE
Computer, Vol. 26, No. 6, June 1993, pp. 39-56. (First version
of the above was published in: G.M. Silberman and K. Ebcioglu,
"An Architectural Framework for Migration from CISC to Higher
Performance Platforms," Proc. 1992 International Conference on
Supercomputing, pp. 198-215, ACM Press, 1992.)
- K. Ebcioglu and E.R. Altman: "DAISY: Dynamic compilation for 100%
Architectural Compatibility" 24th Annual International Symposium on
Computer Architecture, Denver, Colorado, June 2-4, 1997, pp. 26-37.
- M. Gschwind, E. Altman, S. Sathaye, P. Ledak, D. Appenzeller,
"Dynamic and Transparent Binary Translation", IEEE Computer, March
2001.
- K. Ebcioglu, E.R. Altman, M. Gschwind, S. Sathaye, "Dynamic Binary
Translation and Optimization," IEEE Transactions on Computers,
Volume 50, Issue 6, pp. 529 - 548, June 2001.
Historical precedents for predication (conditional execution)
- predication dates back
- Wilkes lecture on control unit design, 1951 - "some of the
micro-orders can be made conditional in their action as well
as (or instead of) conditional as regards the switching of
micro-control"
- IBM 604, 1952 - each instruction had a suppression bit, which
controls whether it is executed or not
- Zemanek's MAILÜFTERL, 1954 - each instruction could be
made dependent on one of 15 conditions (e.g., if the value in
the ACC is negative) specified by a four-bit field in the
instruction format
- Zuse Z22, 1955 - each instruction could be made dependent on
a condition specified in a five-bit field in the instruction
- van der Poel's ZEBRA, 1958 - each instruction could be made
dependent on a condition specified by a three-bit field in
the instruction (this is a refinement of his 1952 ZERO
instruction set in which non-branch instructions could be
made conditional as a side effect of an unusual branching
scheme)
- Electrologica X-1, 1959 - the basic instruction format had two
"precondition bits" that specify whether the instruction should
be executed or not, and two "post condition" bits that specify
how the condition codes should be set after execution
-
IBM ACS, 1967 - a set of 24 condition code registers allowed
precalculation of branch conditions and also supported logical
operations between condition codes; this similar to the eight
independent condition codes in the IBM RS/6000 and PowerPC;
a 'skip flag' bit in each instruction was used along with a
conditional 'skip' instruction to replace regular conditional
branches
- CDC Flexible Processor, 1976 - each microinstruction is
conditionally executed based on three bits in the microinstrucion
format (e.g., selecting among dozens of conditions including
sign of a result, arithmetic overflow, I/O conditions,
and loop control)
- ARM, ca. 1986 - each instruction is predicated
- Cydra 5, 1988 - each instruction is predicated
-
HARP VLIW design, 1988 - each instruction is predicated
- Multiflow /500, 1990 - each floating-point operation or store
could be made conditional (Colwell, et al., Supercomputing '90)
-
HP PlayDoh, 1993 - experimental predicated instruction set
architecture
- Mahlke, et al., "Comparison ..." paper (ISCA, 1995)
- TI VelociTI VLIW architecture, 1997 - each instruction is predicated
- (and lots of architectures have added a conditional move instruction)
- Mahlke, et al., "Comparison of Full and Partial Predicated Execution
Support," ISCA 1995
- limited branch resources restricts # branches handled per cycle
- imperfect branch prediction reduces performance by factor of 2 to 10
- eliminate branches by predication
- partial predication - conditional moves
- full predication - every instruction, but adds another source operand
- compiler performs if-conversion
- processor will fetch instructions from both paths but only allow
instructions with true predicates to issue/complete
- partial predication changes dynamic instruction count by .93 to 2.1
- full predication changes dynamic instruction count by .83 to 1.29
- advantages
- decreases # of branches so limited branch resources are not a problem
- decreases # of mispredicted branches so performance impact is lessened
- exposes multiple execution paths to hardware
- table of Million branches (Million mispredicts)
|
superblock |
conditional |
full |
|
only |
move |
predication |
grep |
.66 (.01) | .17 (.02) | .17 (.02) |
yacc |
12 (.52) | 5.9 (.45) | 5.9 (.43) |
espresso |
75 (3.4) | 38 (2.1) | 33 (1.0) |
eqntott |
315 (42) | 53 (6.7) | 51 (6.9) |
ear |
1539 (66) | 443 (16) | 442 (15) |
Possible insight into register size choice
- Mahlke, Chen, Gyllenhaal, and Hwu, "Compiler code transformations for
superscalar-based high-performance systems," Supercomputing '92,
Minneapolis, Nov. 1992, pp. 808-817
- discusses 2-way issue, 4-way-issue, and 8-way issue
"superscalar/VLIW" processors running 40 loop nests from Perfect Club
benchmarks, SPEC-FP, and vector library functions
- to get maximum effectiveness of the ILP, several compiler optimizations
need to be performed (e.g., loop unrolling, variable renaming, variable
expansion, tree-height reduction)
- each optimization has the effect of increasing the number of registers
needed
- concluding sentence: "37 of the 40 loops require fewer than 128 total
registers after all transformations"
Historical precedents for independence architectures
- the name of this architectural category is due to Josh Fisher and Bob Rau
- explicitly encoded information on instruction independence is
placed in the instruction format by the compiler; difference between
independence architecture and VLIW (and esp. compressed VLIW) is that
in the former the hardware does the scheduling of which instructions
will execute together
- early examples
- NBS PILOT, ready signal, 1958 - bit 65 in the 68-bit instruction
format of the primary computer can be set to indicate that the
program in the primary computer should stop and wait until a
secondary computer has produced previously requested data,
A.L. Leiner, et al., "Concurrently operating computer systems,"
Proc. UNESCO Conference on Information Processing, Paris, June 1959,
pp. 353-361.
- Lee Higbie, concurrency control bits, 1978 -
bits added to instruction format and set by programmer or compiler to
indicate that the execution of an instruction should be delayed until
a specified function unit has produced an operand,
"Overlapped operation with microprogramming," IEEE Trans. on Computers,
March 1978, pp. 270-275. [written while he was at U. Mass. Amherst
about work on a signal processing computer at Sanders]
- Burton Smith's Horizon, lookahead, 1988 - field in instruction format
is set to minimum distance to next dependent instruction (over all
branch paths)
- LIW (long instruction word)
- original Stanford MIPS, 1984 - underpipelined and could pack an ALU op
and a load/store op together into a single machine instruction,
e.g., see Steven Przybylski, et al., "Organization and VLSI
implementation of MIPS," Advances in VLSI and Computer Systems, 1984
- Apollo DN10000, 1988 - "FP companion" bit is leftmost bit of integer
instruction format and is used to indicate if a paired floating-point
instruction follows and is to be issued in parallel; the integer/FP
pair must start on an 8-byte boundary, and an FP instruction cannot
appear without the paired integer instruction); the five-operand
version of the FP instruction format can specify both a multiply and
an independent add/sub/truncate (thus, with the integer operation,
the Apollo can execute a peak of three operations/cycle)
- Intel i860, 1988 - "dual instruction mode (DIM)" bit ("D-bit") in
floating-point instruction format to indicate if aligned pairs of
independent floating-point and integer instructions are to be issued
in parallel (see Kohn, US 5,241,636); because of pipelining the bit
has a two-cycle delayed effect and governs the dual issue of the
instruction pair two cycles later; also, the i860 allowed multiple
ways of specifying the execution of a floating-point addition and
multiply at the same time, thus up to three operations could be
performed per cycle ("dual operation", see Kohn, US 5,204,828)
- CMU iWarp, 1988 - two instruction formats: short (32 bits, loop-back
bit and one operation) and long (96 bits, loop-back bit and either
three floating-point operations or two floating-point operations and
two integer/memory-access operations); references to queue-pointer
registers implicitly resulted in memory loads and stores
- Stanford TORCH, 1990 - two instructions issued together (to "A side"
and "B side" with some slotting restrictions) unless a dynamic nop bit
is set in either instruction's extension byte (see
TORCH architectural specifications)
- Fujitsu VPP500 scalar processor, 1994 - up to three operations per
instruction word; the first four bits of the 64-bit instruction word
serves as the format selector. (see Y. Nakashima, et al.,
"Scalar processor of the VPP500 parallel supercomputer,"
Proc. ICS, 1995)
- traditional VLIW
- roots of VLIW lie in horizontal microprogramming (e.g., Josh Fisher's
work in trace scheduling was done for horizontal microcode)
- see, e.g., van der Poel's "Microprogramming and Trickology", 1962
- other horizontal microcode history and VLIW "pre-history"
- Turing's ACE (1946)
- IBM SSEC (1948) - two instructions in a "line of sequence",
which could could be used to specify two separate operations
within the same program or duplicate operations using separate
resources to provide checking [see US 2,636,672]
- Elliott 152 (1950) and Elliott 153 (1954) - the 153 had a
64-bit instruction specifying multiple register transfers
(ALU, multiplier, I/O, branching, and control of two
scratchpad memories)
- Wilkes and Stringer paper (1953) - suggesting horizontal
microcode
- array processors, including
IBM 2938 Array Processor (1969),
IBM 3838 Array Processor (1974),
and FPS AP-120B (1975)
- P.M. Melliar-Smith, "A design for a fast computer
for scientific calculations," in 1969 AFIPS FJCC,
pp. 201-208. He proposes "direct functional control"
for inner loops in array processing applications, by
which he means a noninterlocked VLIW design with
exposed pipelining. (He's writing in reaction to
execution resources "squandered" and "wasted" by a
Tomasulo-like E-box coupled with a
one-instruction-decode-per-cycle I-box.)
- Culler patent (1973) - "Data processor with parallel
operations per instruction" [US 3,771,141]
- Pomerene patent (1981) - "Machine for multiple instruction
execution" [US 4,295,193]
- Rau's Polycyclic Architecture project at TRW/ESL (1981)
- Fisher's ELI-512 design (1983)
- see J. Fisher, P. Faraboschi, and C. Young,
"VLIW processors: Once blue sky, now commonplace,"
IEEE Solid-State Circuits Magazine, vol. 1, no. 2, Spring 2009,
pp. 10-17.
- compressed VLIW / flexible VLIW
- variable-length encoding of VLIW programs
- Multiflow, 1988
- in-memory compression scheme - VLIW instructions are expanded
during i-cache miss and stored in VLIW format in the i-cache
- a single instruction encodes first-beat and second-beat operations
(slots in the instruction word have a fixed assignment to first
cycle of execution or second cycle of execution)
- mnop - multicycle nop to halt instruction fetch for specified
number of cycles to save space in the i-cache
- Colwell/et al., "A VLIW architecture for a trace scheduling
compiler," IEEE Trans. on Computers, August 1988, pp. 967-979.
- Colwell/et al., "Architecture and Implementation of a VLIW
Supercomputer," Proc. Supercomputing, 1990, pp. 910-919.
- Cydrome, 1988
- normal VLIW multi-op format (256-bit instruction word, seven fields)
- for compression, added a uni-op format which contained routing fields
to specify which function units were used (six 40-bit uni-op
instructions are held in a 256-bit instruction word)
- mnoop was a special uni-op instruction that halted instruction
execution for a specified number of cycles to allow for the cases
where no instructions were ready to execute (and thus avoid a
series of empty multi-ops or uni-ops)
- memory latency register specifies latency used by compiler when
scheduling; hardware buffers values from any loads that complete
earlier or stalls the processor if any loads complete later than
the specified number of cycles
- had plans for an in-memory compression scheme (vari-ops) for
second generation design (Cydra-10); similar to Multifow since
vari-ops would be expanded into one or more multi-op instructions
during i-cache miss processing
- Beck/Yen/Anderson, "The Cydra 5 minisupercomputer: Architecture and
implementation," Journal of Supercomputing, May 1993, pp. 143-180.
- Intergraph Clipper 5 (U.S. patent 5,560,028, 1996)
- called a "software scheduled superscalar" architecture but more
accurately classified as a compressed VLIW scheme
- tags are added for multiple-issue group identification
along with routing tags for function unit assignment; the tags
control a crossbar switch
- later paper mentions use of a register scoreboard to determine
when to issue the next group
- Arya/Sachs/Duvvuru, "An architecture for high instruction level
parallelism," 28th Hawaii Intl. Conf. Syst. Sci., 1995, pp. 153-162.
- [Arya worked for Higbie while they were at Gould in 1980s]
- Philips Trimedia, 1996
- a compressed instruction format is stored in the cache as well
as memory and is expanded by a decompressor unit (decompression
takes place during one pipeline stage of instruction fetch)
- the encoding eliminates nops by using a header that includes a
count of operations in that instruction
- an uncompressed instruction has five operation slots, each of
which contains an execution unit identifier that is used to
route that operation to the appropriate execution unit
- three generations: TM-1, TM-1000, and CPU64
- TI VelociTI VLIW architecture, 1997
- design started in 1992, chief architect is Ray Simar
- fetch packet of 8 instructions
- one to eight variable-length, multiple-issue execute packets can
be contained within each fetch packet; they are delimited by
"parallel instruction" link bits in the instruction format
- 5 delay slots per branch and 4 per load
- multicycle nop
- TI C6x family: C62x, C64x, and C67x
- Starcore, 1998
- 16-bit instruction formats
- VLES (variable length execution set) - two options:
- serial - a two-bit field is allocated in the instruction format of
a subset of the instructions; "00" indicates that the current
instruction is included with the next, other values indicate a stop
- prefix - instructions can also be grouped using one or two prefix
words; the prefix contains a set count and also provides for
conditional execution, access to more registers, and looping
- SC140, 1998 - up to six instructions in an execution set
- SC110, 200x - up to three instructions in an execution set
- execution set determined from encoding during dispatch stage
in a 5-stage pipeline ( prefetch / fetch / dispatch / address
generation / execute )
- each execution set advances as a unit; thus, the longest running
instruction determines the number of cycles its execution set
occupies the execution pipeline stage
- TigerSHARC, 1998
- "static superscalar" - one to four 32-bit instructions can be executed
each cycle from a 128-bit instruction line, most significant bit of
each instruction acts as a stop bit
- minor slotting restrictions (e.g., a conditional or program sequencer
instruction must be placed in the first slot of a line)
- no memory alignment restrictions for instruction lines
- register scoreboard, stalls complete line
- Sun MAJC, 1999
- one to four instructions, count field in first instruction
- retains slotted assignment to function units, each of which is
general purpose
- function units have separate set of local registers and share
a common set of global registers
- load-use and long-latency-operation register scoreboard
- after fetch, align stage prepares for 1-to-4-way issue based
on count field in first unissued instruction
- Tremblay/Chan/Chaudhry/Conigliaro/Tse, "The MAJC architecture:
A synthesis of parallelism and scalability," IEEE Micro,
November-December 2000, pp. 12-25.
- Fujitsu FR-V family, 1999
- each 32-bit instruction has a 1-bit packing flag, acts as a stop bit
- up to four instructions in parallel, "nop insertion and slot
distribution" occur after fetching from the i-cache
- fairly general functional units so slot assignment is not a big issue
- Sukemura, "FR500 VLIW-architecture high-performance embedded
microprocessor," Fujitsu Sci. Tech. Jrnl., June 2000, pp. 31-38.
- Suga/Matsunami, "Introducing the FR500 embedded microprocessor,"
IEEE Micro, July-August 2000, pp. 21-27.
- Aditya/Mahlke/Rau, "Code size minimization and retargetable assembly
for custom EPIC and VLIW instruction formats," HP technical report
HPL-2000-141, Oct. 2000.
- IBM SCISM, early 1990s
- compound units "reflect the parallel issue of instructions"
- 3 instructions per compounded unit (provision made to jump into
the middle of a compound unit)
- compounding can be done by the compiler, at the time of a
page fault, or at the time of i-cache refill
- compound units include tag bits that can indicate dependency
info, e.g., for interlock-collapsing function units
- Vassiliadis/Blaner/Eickemeyer, "SCISM: A scalable compound
instruction set machine," IBM JRD, 38/1, Jan. 1994, pp. 59-78
- apparently never built, but lots of patents
- Transmeta Crusoe, 2000 - six instruction formats (2-4 instructions)
- AA - two ALU instructions
- AB - ALU instruction and branch
- AI - ALU instruction with 32-bit immediate value
- LA - load/store and ALU instruction
- LAAB - load/store, two ALU instructions, branch
- LAAI - load/store, two ALU instructions (one w/ 32-bit immediate)
(Several patents, including U.S. 6,031,992, 2000)
- retrofitting: examples of hardware marking of independence
internally via predecoding and retaining the marking within a decoded
i-cache (i.e., when you move the dependency detection out of the fetch
and decode pipeline stages but not all the way back to compile time due
to instruction set compatibility)
-
NS Swordfish, 1991 - instruction pair dependency bit is contained
in each decoded i-cache entry; it is set on i-cache refill by predecode
hardware and yields LIW issue of independent instruction pairs; no bits
are used in the normal instruction format.
- Minigawa/Saito/Aikawa, 1991 - "Pre-decoding mechanism for superscalar
architecture," IEEE Pacific Rim Conf. on Comm., Comp., and Sig. Proc.,
pp. 22-24; on i-cache miss, a predecoder adds instruction grouping
("priority") and function unit assignment fields.
(see also US Patent 5,163,139, "Instruction preprocessor for
conditionally combining short memory instructions into virtual
long instructions")
- HP 7200, 1995 - six predecode bits are added for each double word in
the i-cache; they encode resource conflicts and data dependencies and
are set by a predecoder on i-cache refill.
Historical precedents for prepare-to-branch
- Four aspects of conditional branching
- Condition setting
- condition storage
- single set of bits in PSR or flags register for integer conditions
- second set of bits in FP status register for flt. pt. conditions
- high-perf. implementation problem for inst. sets that use single,
serialized resources [cf. Sites on design of Alpha]
- multiple sets of bits (e.g., RS/6000)
- use of general registers (e.g., MC88110)
- specification of comparands
- explicit compare instruction (basically a subtraction)
- side effect of ALU operation (setting by ALU op is optional in SPARC)
- Decision - logical relation between comparands (eq, ne, lt, le, gt, ge,
flt.pt unordered)
- Branch target address
- Change PC
- immediate effect
- delayed effect - next one or so sequential instructions have already
been fetched and will be executed regardless of branch decision
- delayed effect with anulling/squashing - sequential instructions already
fetched and may be executed or optionally purged on untaken (e.g., SPARC)
- Packaging these aspects
- compare (1), then conditional branch (2+3+4)
- ALUop side effect (1), then conditional branch (2+3+4)
- compare and branch (1+2+3+4) - may need multiple comparand specifiers
plus the branch address field, although often use reg. vs. 0
- [
IBM ACS, 1967] prepare to branch (1+2+3), then exit (4)
- ISA add-ons
- [TI ASC, 1972] prepare to branch --
redundant specification of 3, for prefetch
- [PIPE, 1985 (Pleszkun and Farrens)] prepare to branch --
specify 4B, intended as generalized delayed branch technique where
the PTB instruction would specify the number of delay slots after a
branch instruction (0-7)
Historical precedents for rotating register files
- "A different, programmatically controlled register renaming scheme is
obtained by providing rotating register files, that is, base-displacement
indexing into the register file using an instruction-provided displacement
off a dedicated base register.
Although applicable only for renaming registers across multiple
iterations of a loop, rotating registers have the advantage of being
considerably less expensive in their implementation than are other
renaming schemes." - Rau and Fisher, Jrnl. Supercomputing, 1993, p 22.
- scratch-pad in AP-120B/FPS-164, 1976 (Charlesworth)
- compacting FIFO structure in Polycyclic Architecture at TRW/ESL, 1981 (Rau)
- rotating registers in Cydrome Cydra-5, 1988 (Rau)
Historical precedents for register stack engine
- Dick Site's dribble-back registers (1979)
- Hitachi SR2201 preload and poststore ("slide-windowed registers")
My thanks to Harsh Sharangpani for his help; Jason Eckhardt for help
with the i860 and AP120-B descriptions; and, Norm Hardy for pointing
me to the Gray, et al., paper.
[Computer Architecture History page]
[Mark's homepage]
[email protected]