MIPS R10000 Microprocessor
User's Manual
Version 1.0
Copyright (C) 1995 MIPS Technologies, Inc.
ALL RIGHTS RESERVED
U.S. GOVERNMENT RESTRICTED RIGHTS LEGEND
Use, duplication or disclosure by the Government is subject to restrictions as set forth in FAR 52.227.19(c)(2) or subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and/or in similar or successor clauses in the FAR, or the DOD or NASA FAR Supplement. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94039-7311.
RISCompiler, RISC/os, R2000, R6000, R4000, R4400, and R10000 are trademarks of MIPS Technologies, Inc. MIPS and R3000 are registered trademarks of MIPS Technologies, Inc.
UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd.
MIPS Technologies, Inc.
2011 North Shoreline
Mountain View, California 94039-7311
http://www.mips.com
Acknowledgments
This book represents a consortium of efforts, and is principally derived from material provided by Randy Martin, Yung-Chin Chen, and Ken Yeager.
Thanks also to Randy for his many painstaking reviews of this manual.
Also providing invaluable service were the following:
Shabbir Latif, for once again running point between Engineering and Publications, answering questions, and presenting tutorials to clarify the complicated details of the R10000 processor operations.
Charlie Price, for use of his rejuvenated MIPS-4 Instruction Set Architecture.
Steve Proffitt, for both his technical assistance, and helping handle the multitude of niggling details involved in getting this manual printed.
The following also provided technical help in innumerable ways: Arun Mehta, Tim Layman, Greg Shippen, Yeffi Van Atta, John Brennan, Len Widra, Roy Johnson, Hector Sucar, Hong-Men Su, Mazin Khurshid, Steve Whitney, Doug Yanagawa (chip illustrations and socket pinouts), Mike Gupta, Steven Peltier, Rob Conrad, Hai Nguyen, Bill Voegtli, and Sharad Mehrotra at the University of Illinois.
Remediating a prior deficiency, thanks to Tom McReynolds.
In Production and Creative, thanks to Melissa Miller for her design of the cover; Yen Nguyen, for handling the printing; both Kay Maitz and Beth Fraker for resolving various design issues; and Michael Ritchie for tracking progress.
Joe Heinrich
June, 1995
Mt. View, California
About This Manual
This manual describes the MIPS R10000 RISC microprocessor (also referred to as the processor in this book).
Glossary
Certain specialized terms used in this book are defined in the Glossary at the end of this manual.
Stylistic Conventions
A brief note on some of the stylistic conventions used in this book: bits, fields, and registers of interest from a software perspective are italicized (such as the BE bit in the Config register).
Signal names of more importance from a hardware point of view are rendered in bold (such as Reset*). The asterisk appended to the signal name (as in Reset*) indicates the signal is low-active.
A range of bits uses a colon as a separator; for instance, (15:0) represents the 16-bit range that runs from bit 0, inclusive, through bit 15. In some places an ellipsis (15...0) or partial ellipsis (15..0) may used in place of a colon for visibility.
Unfamiliar terms presented for the first time are printed in bold letters, and are followed as closely as possible by a definition or description.
Getting MIPS Documents On-Line
The information in this manual, and other MIPS-related product information, is also available over the Word Wide Web at:
http://www.mips.com/
Requests can also be e-mailed to webmaster@mti.sgi.com.
MIPS documents are also available for retrieval on-line, through the file transport protocol (FTP). To retrieve them, follow the steps below. The text you are to type is shown in Courier Bold font; the computer's responses are in shown in Courier Regular font.
1. First, place yourself in the directory on your system within which you want to store the retrieved files. Do this by typing:
cd <directory_you_want_file_to_be_in>
2. Access the MIPS document server, sgigate, through FTP by typing:
ftp sgigate.sgi.com
3. The server tells you when you are connected for FTP by responding:
Connected to sgigate.sgi.com.
4. Next (after some announcements) the server asks you to log in by requesting a name and then a password.
Name (sgigate.sgi.com:<login_name>):
5. Login by typing anonymous for your name and your electronic mail address for your password.
Name (sgigate.sgi.com:<login_name>): anonymous
331 Guest login ok, type your name as password.
Password: your_email_address
6. The system indicates you have logged in by supplying an FTP prompt:
ftp>
7. Go to the pub/doc directory by typing:
ftp> cd pub/doc
8. You can take a look at the contents of the doc directory by listing them:
ftp> ls
9. You will find several subdirectories, such as R4200, R8000, and R10000. When you find the subdirectory you want, cd into that subdirectory and retrieve the file you want by typing:
get <filename>
10. This copies the file from sgigate back to your system.
11. When you have retrieved all the files you want, exit from ftp by typing quit:
ftp> quit
12. If the file was encoded for transmission, you must decode it, after retrieval, by typing:
uudecode <filename>
13. If the file was compressed for transmission, you must uncompress it, after retrieval, by typing:
uncompress <filename>
14. If you tarred the file, type:
tar xvof <filename>
Introduction to the R10000 Processor
1
This user's manual describes the R10000 superscalar microprocessor for the system designer, paying special attention to the external interface and the transfer protocols.
This chapter describes the following:
* MIPS ISA
* what makes a generic superscalar microprocessor
* specifics of the R10000 superscalar microprocessor
* implementation-specific CPU instructions
1.1 MIPS Instruction Set Architecture (ISA)
MIPS has defined an instruction set architecture (ISA), implemented in the following sets of CPU designs:
* MIPS I, implemented in the R2000 and R3000
* MIPS II, implemented in the R6000
* MIPS III, implemented in the R4400
* MIPS IV, implemented in the R8000 and R10000
The original MIPS I CPU ISA has been extended forward three times, as shown in Figure 1-1; each extension is backward compatible. The ISA extensions are inclusive; each new architecture level (or version) includes the former levels.*1
Figure 1-1 MIPS ISA with Extensions
The practical result is that a processor implementing MIPS IV is also able to run MIPS I, MIPS II, or MIPS III binary programs without change.
1.2 What is a Superscalar Processor?
A superscalar processor is one that can fetch, execute and complete more than one instruction in parallel.
Pipeline and Superpipeline Architecture
Previous MIPS processors had linear pipeline architectures; an example of such a linear pipeline is the R4400 superpipeline, shown in Figure 1-2. In the R4400 superpipeline architecture, an instruction is executed each cycle of the pipeline clock (PCycle), or each pipe stage.
Figure 1-2 R4400 Pipeline
Superscalar Architecture
The structure of 4-way superscalar pipeline is shown in Figure 1-3. At each stage, four instructions are handled in parallel. Note that there is only one EX stage for integers.
Figure 1-3 4-Way Superscalar Pipeline
1.3 What is an R10000 Microprocessor?
The R10000 processor is a single-chip superscalar RISC microprocessor that is a follow-on to the MIPS RISC processor family that includes, chronologically, the R2000, R3000, R6000, R4400, and R8000.
The R10000 processor uses the MIPS ANDES architecture, or Architecture with Non-sequential Dynamic Execution Scheduling.
The R10000 processor has the following major features (terms in bold are defined in the Glossary):
* it implements the 64-bit MIPS IV instruction set architecture (ISA)
* it can decode four instructions each pipeline cycle, appending them to one of three instruction queues
* it has five execution pipelines connected to separate internal integer and floating-point execution (or functional) units
* it uses dynamic instruction scheduling and out-of-order execution
* it uses speculative instruction issue (also termed "speculative branching")
* it uses a precise exception model (exceptions can be traced back to the instruction that caused them)
* it uses non-blocking caches
* it has separate on-chip 32-Kbyte primary instruction and data caches
* it has individually-optimized secondary cache and System interface ports
* it has an internal controller for the external secondary cache
* it has an internal System interface controller with multiprocessor support
The R10000 processor is implemented using 0.35-micron CMOS VLSI circuitry on a single 17 mm-by-18 mm chip that contains about 6.8 million transistors, including about 4.4 million transistors in its primary caches.
R10000 Superscalar Pipeline
The R10000 superscalar processor fetches and decodes four instructions in parallel each cycle (or pipeline stage). Each pipeline includes stages for fetching (stage 1 in Figure 1-4), decoding (stage 2) issuing instructions (stage 3), reading register operands (stage 3), executing instructions (stages 4 through 6), and storing results (stage 7).
Figure 1-4 Superscalar Pipeline Architecture in the R10000
Instruction Queues
As shown in Figure 1-4, each instruction decoded in stage 2 is appended to one of three instruction queues:
* integer queue
* address queue
* floating-point queue
Execution Pipelines
The three instruction queues can issue (see the Glossary for a definition of issue) one new instruction per cycle to each of the five execution pipelines:
* the integer queue issues instructions to the two integer ALU pipelines
* the address queue issues one instruction to the Load/Store Unit pipeline
* the floating-point queue issues instructions to the floating-point adder and multiplier pipelines
A sixth pipeline, the fetch pipeline, reads and decodes instructions from the instruction cache.
64-bit Integer ALU Pipeline
The 64-bit integer pipeline has the following characteristics:
* it has a 16-entry integer instruction queue that dynamically issues instructions
* it has a 64-bit 64-location integer physical register file, with seven read and three write ports (32 logical registers; see register renaming in the Glossary)
* it has two 64-bit arithmetic logic units:
- ALU1 contains an arithmetic-logic unit, shifter, and integer branch comparator
- ALU2 contains an arithmetic-logic unit, integer multiplier, and divider
Load/Store Pipeline
The load/store pipeline has the following characteristics:
* it has a 16-entry address queue that dynamically issues instructions, and uses the integer register file for base and index registers
* it has a 16-entry address stack for use by non-blocking loads and stores
* it has a 44-bit virtual address calculation unit
* it has a 64-entry fully associative Translation-Lookaside Buffer (TLB), which converts virtual addresses to physical addresses, using a 40-bit physical address. Each entry maps two pages, with sizes ranging from 4 Kbytes to 16 Mbytes, in powers of 4.
64-bit Floating-Point Pipeline
The 64-bit floating-point pipeline has the following characteristics:
* it has a 16-entry instruction queue, with dynamic issue
* it has a 64-bit 64-location floating-point physical register file, with five read and three write ports (32 logical registers)
* it has a 64-bit parallel multiply unit (3-cycle pipeline with 2-cycle latency) which also performs move instructions
* it has a 64-bit add unit (3-cycle pipeline with 2-cycle latency) which handles addition, subtraction, and miscellaneous floating-point operations
* it has separate 64-bit divide and square-root units which can operate concurrently (these units share their issue and completion logic with the floating-point multiplier)
A block diagram of the processor and its interfaces is shown in Figure 1-5, followed by a description of its major logical blocks.
Figure 1-5 Block Diagram of the R10000 Processor
Functional Units
The five execution pipelines allow overlapped instruction execution by issuing instructions to the following five functional units:
* two integer ALUs (ALU1 and ALU2)
* the Load/Store unit (address calculate)
* the floating-point adder
* the floating-point multiplier
There are also three "iterative" units to compute more complex results:
* Integer multiply and divide operations are performed by an Integer Multiply/Divide execution unit; these instructions are issued to ALU2. ALU2 remains busy for the duration of the divide.
* Floating-point divides are performed by the Divide execution unit; these instructions are issued to the floating-point multiplier.
* Floating-point square root are performed by the Square-root execution unit; these instructions are issued to the floating-point multiplier.
Primary Instruction Cache (I-cache)
The primary instruction cache has the following characteristics:
* it contains 32 Kbytes, organized into 16-word blocks, is 2-way set associative, using a least-recently used (LRU) replacement algorithm
* it reads four consecutive instructions per cycle, beginning on any word boundary within a cache block, but cannot fetch across a block boundary.
* its instructions are predecoded, its fields are rearranged, and a 4-bit unit select code is appended
* it checks parity on each word
* it permits non-blocking instruction fetch
Primary Data Cache (D-cache)
The primary data cache has the following characteristics:
* it has two interleaved arrays (two 16 Kbyte ways)
* it contains 32 Kbytes, organized into 8-word blocks, is 2-way set associative, using an LRU replacement algorithm.
* it handles 64-bit load/store operations
* it handles 128-bit refill or write-back operations
* it permits non-blocking loads and stores
* it checks parity on each byte
Instruction Decode And Rename Unit
The instruction decode and rename unit has the following characteristics:
* it processes 4 instructions in parallel
* it replaces logical register numbers with physical register numbers (register renaming)
- it maps integer registers into a 33-word-by-6-bit mapping table that has 4 write and 12 read ports
- it maps floating-point registers into a 32-word-by-6-bit mapping table that has 4 write and 16 read ports
* it has a 32-entry active list of all instructions within the pipeline.
Branch Unit
The branch unit has the following characteristics:
* it allows one branch per cycle
* conditional branches can be executed speculatively, up to 4-deep
* it has a 44-bit adder to compute branch addresses
* it has a 4-quadword branch-resume buffer, used for reversing mispredicted speculatively-taken branches
* the Branch Link Quadword contains four instructions following a subroutine call, for rapid use when returning from leaf subroutines
* it has program trace RAM that stores the program counter for each instruction in the pipeline
External Interfaces
The external interfaces have the following characteristics:
* a 64-bit System interface allows direct-connection for 2-way to
* a secondary cache interface with 128-bit data path and tag fields. 9-bit ECC Error Check and Correction is made on data quadwords, 7-bit ECC is made on tag words. It allows connection to an external secondary cache that can range from 512 Kbytes to 16 Mbytes, using external static RAMs. The secondary cache can be organized into either 16- or 32-word blocks, and is 2-way set associative.
Bit definitions are given in Chapter 3.
1.4 Instruction Queues
The processor keeps decoded instructions in three instruction queues, which dynamically issue instructions to the execution units. The queues allow the processor to fetch instructions at its maximum rate, without stalling because of instruction conflicts or dependencies.
Each queue uses instruction tags to keep track of the instruction in each execution pipeline stage. These tags set a Done bit in the active list as each instruction is completed.
Integer Queue
The integer queue issues instructions to the two integer arithmetic units: ALU1 and ALU2.
The integer queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded integer instructions are written into empty entries in no particular order. Instructions remain in this queue only until they have been issued to an ALU.
Branch and shift instructions can be issued only to ALU1. Integer multiply and divide instructions can be issued only to ALU2. Other integer instructions can be issued to either ALU.
The integer queue controls six dedicated ports to the integer register file: two operand read ports and a destination write port for each ALU.
Floating-Point Queue
The floating-point queue issues instructions to the floating-point multiplier and the floating-point adder.
The floating-point queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded floating-point instructions are written into empty entries in random order. Instructions remain in this queue only until they have been issued to a floating-point execution unit.
The floating-point queue controls six dedicated ports to the floating-point register file: two operand read ports and a destination port for each execution unit.
The floating-point queue uses the multiplier's issue port to issue instructions to the square-root and divide units. These instructions also share the multiplier's register ports.
The floating-point queue contains simple sequencing logic for multiple-pass instructions such as Multiply-Add. These instructions require one pass through the multiplier, then one pass through the adder.
Address Queue
The address queue issues instructions to the load/store unit.
The address queue contains 16 instruction entries. Unlike the other two queues, the address queue is organized as a circular First-In First-Out (FIFO) buffer. A newly decoded load/store instruction is written into the next available sequential empty entry; up to four instructions may be written during each cycle.
The FIFO order maintains the program's original instruction sequence so that memory address dependencies may be easily computed.
Instructions remain in this queue until they have graduated; they cannot be deleted immediately after being issued, since the load/store unit may not be able to complete the operation immediately.
The address queue contains more complex control logic than the other queues. An issued instruction may fail to complete because of a memory dependency, a cache miss, or a resource conflict; in these cases, the queue must continue to reissue the instruction until it is completed.
The address queue has three issue ports:
* First, it issues each instruction once to the address calculation unit. This unit uses a 2-stage pipeline to compute the instruction's memory address and to translate it in the TLB. Addresses are stored in the address stack and in the queue's dependency logic. This port controls two dedicated read ports to the integer register file. If the cache is available, it is accessed at the same time as the TLB. A tag check can be performed even if the data array is busy.
* Second, the address queue can re-issue accesses to the data cache. The queue allocates usage of the four sections of the cache, which consist of the tag and data sections of the two cache banks. Load and store instructions begin with a tag check cycle, which checks to see if the desired address is already in cache. If it is not, a refill operation is initiated, and this instruction waits until it has completed. Load instructions also read and align a doubleword value from the data array. This access may be either concurrent to or subsequent to the tag check. If the data is present and no dependencies exist, the instruction is marked done in the queue.
* Third, the address queue can issue store instructions to the data cache. A store instruction may not modify the data cache until it graduates. Only one store can graduate per cycle, but it may be anywhere within the four oldest instructions, if all previous instructions are already completed.
The access and store ports share four register file ports (integer read and write, floating-point read and write). These shared ports are also used for Jump and Link and Jump Register instructions, and for move instructions between the integer and register files.
1.5 Program Order and Dependencies
From a programmer's perspective, instructions appear to execute sequentially, since they are fetched and graduated in program order (the order they are presented to the processor by software). When an instruction stores a new value in its destination register, that new value is immediately available for use by subsequent instructions.
Internal to the processor, however, instructions are executed dynamically, and some results may not be available for many cycles; yet the hardware must behave as if each instruction is executed sequentially.
This section describes various conditions and dependencies that can arise from them in pipeline operation, including:
* instruction dependencies
* execution order and stalling
* branch prediction and speculative execution
* resolving operand dependencies
* resolving exception dependencies
Instruction Dependencies
Each instruction depends on all previous instructions which produced its operands, because it cannot begin execution until those operands become valid. These dependencies determine the order in which instructions can be executed.
Execution Order and Stalling
The actual execution order depends on the processor's organization; in a typical pipelined processor, instructions are executed only in program order. That is, the next sequential instruction may begin execution during the next cycle, if all of its operands are valid. Otherwise, the pipeline stalls until the operands do become valid.
Since instructions execute in order, stalls usually delay all subsequent instructions.
A clever compiler can improve performance by re-arranging instructions to reduce the frequency of these stall cycles.
* In an in-order superscalar processor, several consecutive instructions may begin execution simultaneously, if all their operands are valid, but the processor stalls at any instruction whose operands are still busy.
* In an out-of-order superscalar processor, such as the R10000, instructions are decoded and stored in queues. Each instruction is eligible to begin execution as soon as its operands become valid, independent of the original instruction sequence. In effect, the hardware rearranges instructions to keep its execution units busy. This process is called dynamic issuing.
Branch Prediction and Speculative Execution
Although one or more instructions may begin execution during each cycle, each instruction takes several (or many) cycles to complete. Thus, when a branch instruction is decoded, its branch condition may not yet be known. However, the R10000 processor can predict whether the branch is taken, and then continue decoding and executing subsequent instructions along the predicted path.
When a branch prediction is wrong, the processor must back up to the original branch and take the other path. This technique is called speculative execution. Whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores to the processor to the state it held before the branch.
Resolving Operand Dependencies
Operands include registers, memory, and condition bits. Each operand type has its own dependency logic. In the R10000 processor, dependencies are resolved in the following manner:
* register dependencies are resolved by using register renaming and the associative comparator circuitry in the queues
* memory dependencies are resolved in the Load/Store Unit
* condition bit dependencies are resolved in the active list and instruction queues
Resolving Exception Dependencies
In addition to operand dependencies, each instruction is implicitly dependent upon any previous instruction that generates an exception. Exceptions are caused whenever an instruction cannot be properly completed, and are usually due to either an untranslated virtual address or an erroneous operand.
The processor design implements precise exceptions, by:
* identifying the instruction which caused the exception
* preventing the exception-causing instruction from graduating
* aborting all subsequent instructions
Thus, all register values remain the same as if instructions were executed singly. Effectively, all previous instructions are completed, but the faulting instruction and all subsequent instructions do not modify any values.
Strong Ordering
A multiprocessor system that exhibits the same behavior as a uniprocessor system in a multiprogramming environment is said to be strongly ordered.
The R10000 processor behaves as if strong ordering is implemented, although it does not actually execute all memory operations in strict program order.
In the R10000 processor, store operations remain pending until the store instruction is ready to graduate. Thus, stores are executed in program order, and memory values are precise following any exception.
For improved performance however, cached load operations my occur in any order, subject to memory dependencies on pending store instructions. To maintain the appearance of strong ordering, the processor detects whenever the reordering of a cached load might alter the operation of the program, backs up, and then re-executes the affected load instructions. Specifically, whenever a primary data cache block is invalidated due to an external coherency request, its index is compared with all outstanding load instructions. If there is a match and the load has been completed, the load is prevented from graduating. When it is ready to graduate, the entire pipeline is flushed, and the processor is restored to the state it had before the load was decoded.
An uncached or uncached accelerated load or store instruction is executed when the instruction is ready to graduate. This guarantees strong ordering for uncached accesses.
Since the R10000 processor behaves as if it implemented strong ordering, a suitable system design allows the processor to be used to create a shared-memory multiprocessor system with strong ordering.
An Example of Strong Ordering
Given that locations X and Y have no particular relationship--that is, they are not in the same cache block--an example of strong ordering is as follows:
* Processor A performs a store to location X and later executes a load from location Y.
* Processor B performs a store to location Y and later executes a load from location X.
The two processors are running asynchronously, and the order of the above two sequences is unknown.
For the system to be strongly ordered, either processor A must load the new value of Y, or processor B must load the new value of X, or both processors A and B must load the new values of Y and X, respectively, under all conditions.
If processors A and B both load old values of Y and X, respectively, under any conditions, the system is not strongly ordered.
1.6 R10000 Pipelines
This section describes the stages of the superscalar pipeline.
Instructions are processed in six partially-independent pipelines, as shown in Figure 1-4. The Fetch pipeline reads instructions from the instruction cache*2, decodes them, renames their registers, and places them in three instruction queues. The instruction queues contain integer, address calculate, and floating-point instructions. From these queues, instructions are dynamically issued to the five pipelined execution units.
In stage 1, the processor fetches four instructions each cycle, independent of their alignment in the instruction cache -- except that the processor cannot fetch across a 16-word cache block boundary. These words are then aligned in the 4-word Instruction register.
If any instructions were left from the previous decode cycle, they are merged with new words from the instruction cache to fill the Instruction register.
Stage 2
In stage 2, the four instructions in the Instruction register are decoded and renamed. (Renaming determines any dependencies between instructions and provides precise exception handling.) When renamed, the logical registers referenced in an instruction are mapped to physical registers. Integer and floating-point registers are renamed independently.
A logical register is mapped to a new physical register whenever that logical register is the destination of an instruction. Thus, when an instruction places a new value in a logical register, that logical register is renamed (mapped) to a new physical register, while its previous value is retained in the old physical register.
As each instruction is renamed, its logical register numbers are compared to determine if any dependencies exist between the four instructions decoded during this cycle. After the physical register numbers become known, the Physical Register Busy table indicates whether or not each operand is valid. The renamed instructions are loaded into integer or floating-point instruction queues.
Only one branch instruction can be executed during stage 2. If the instruction register contains a second branch instruction, this branch is not decoded until the next cycle.
The branch unit determines the next address for the Program Counter; if a branch is taken and then reversed, the branch resume cache provides the instructions to be decoded during the next cycle.
Stage 3
In stage 3, decoded instructions are written into the queues. Stage 3 is also the start of each of the five execution pipelines.
Stages 4-6
In stages 4 through 6, instructions are executed in the various functional units. These units and their execution process are described below.
Floating-Point Multiplier (3-stage Pipeline)
Single- or double-precision multiply and conditional move operations are executed in this unit with a 2-cycle latency and a 1-cycle repeat rate. The multiplication is completed during the first two cycles; the third cycle is used to pack and transfer the result.
Floating-Point Divide and Square-Root Units
Single- or double-precision division and square-root operations can be executed in parallel by separate units. These units share their issue and completion logic with the floating-point multiplier.
Floating-Point Adder (3-stage Pipeline)
Single- or double-precision add, subtract, compare, or convert operations are executed with a 2-cycle latency and a 1-cycle repeat rate. Although a final result is not calculated until the third pipeline stage, internal bypass paths set a 2-cycle latency for dependent add or multiply instructions.
Integer ALU1 (1-stage Pipeline)
Integer add, subtract, shift, and logic operations are executed with a 1-cycle latency and a 1-cycle repeat rate. This ALU also verifies predictions made for branches that are conditional on integer register values.
Integer ALU2 (1-stage Pipeline)
Integer add, subtract, and logic operations are executed with a 1-cycle latency and a 1-cycle repeat rate. Integer multiply and divide operations take more than one cycle.
Address Calculation and Translation in the TLB
A single memory address can be calculated every cycle for use by either an integer or floating-point load or store instruction. Address calculation and load operations can be calculated out of program order.
The calculated address is translated from a 44-bit virtual address into a 40-bit physical address using a translation-lookaside buffer. The TLB contains 64 entries, each of which can translate two pages. Each entry can select a page size ranging from 4 Kbytes to 16 Mbytes, inclusive, in increments of 4, as shown in Figure 1-6.
Figure 1-6 TLB Page Sizes
Load instructions have a 2-cycle latency if the addressed data is already within the data cache.
Store instructions do not modify the data cache or memory until they graduate.
1.7 Implications of R10000 Microarchitecture on Software
The R10000 processor implements the MIPS architecture by using the following techniques to improve throughput:
* superscalar instruction issue
* speculative execution
* non-blocking caches
These microarchitectural techniques have special implications for compilation and code scheduling.
Superscalar Instruction Issue
The R10000 processor has parallel functional units, allowing up to four instructions to be fetched and up to five instructions to be issued or completed each cycle. An ideal code stream would match the fetch bandwidth of the processor with a mix of independent instructions to keep the functional units as busy as possible.
To create this ideal mix, every cycle the hardware would select one instruction from each of the columns below. (Floating-point divide, floating-point square root, integer multiply and integer divide cannot be started on each cycle.) The processor can look ahead in the code, so the mix should be kept close to the ideal described below.
Column A Column B Column C Column D Column E
FPadd FP mul FPload add/sub add/sub
FPdiv FPstore shift mul
FPsqrt load branch div
store logical logical
Data dependencies are detected in hardware, but limit the degree of parallelism that can be achieved. Compilers can intermix instructions from independent code streams.
Speculative Execution
Speculative execution increases parallelism by fetching, issuing, and completing instructions even in the presence of unresolved conditional branches and possible exceptions. Following are some suggestions for increasing program efficiency:
* Compilers should reduce the number of branches as much as possible
* "Jump Register" instructions should be avoided.
* Aggressive use of the new integer and floating point conditional move instructions is recommended.
* Branch prediction rates may be improved by organizing code so that each branch goes the same direction most of the time, since a branch that is taken 50% of the time has higher average cost than one taken 90% of the time. The MIPS IV conditional move instructions may be effective in improving performance by replacing unpredictable branches.
Nonblocking Caches
As processor speed increases, the processor's data latency and bandwidth requirements rise more rapidly than the latency and bandwidth of cost-effective main memory systems. The memory hierarchy of the R10000 processor tries to minimize this effect by using large set-associative caches and higher bandwidth cache refills to reduce the cost of loads, stores, and instruction fetches. Unlike the R4400, the R10000 processor does not stall on data cache misses, instead defers execution of any dependent instructions until the data has been returned and continues to execute independent instructions (including other memory operations that may miss in the cache). Although the R10000 allows a number of outstanding primary and secondary cache misses, compilers should organize code and data to reduce cache misses. When cache misses are inevitable, the data reference should be scheduled as early as possible so that the data can be fetched in parallel with other unrelated operations.
As a further antidote to cache miss stalls, the R10000 processor supports prefetch instructions, which serve as hints to the processor to move data from memory into the secondary and primary caches when possible. Because prefetches do not cause dependency stalls or memory management exceptions, they can be scheduled as soon as the data address can be computed, without affecting exception semantics. Indiscriminate use of prefetch instructions can slow program execution because of the instruction-issue overhead, but selective use of prefetches based on compiler miss prediction can yield significant performance improvement for dense matrix computations.
1.8 R10000-Specific CPU Instructions
This section describes the processor-specific implementations of the following instructions:
* PREF
* LL/SC
* SYNC
Chapter 14
PREF
In the R1000 processor, the Prefetch instruction, PREF, attempts to fetch data into the secondary and primary data caches. The action taken by a Prefetch instruction is controlled by the instruction hint field, as decoded in Table 1-1.
Table 1-1 PREF Instruction Hint Field
For a "store" Prefetch, an Exclusive copy of the cache block must be obtained, in order that it may be written.
LL/SC
Load Linked and Store Conditional instructions are used together to implement a memory semaphore. Each LL/SC sequence has three sections:
1. The LL loads a word from memory.
2. A short sequence of instructions checks or modifies this word. This sequence must not contain any of the events listed below, or the Store Conditional will fail:
* exception
* execution of ERET
* load instruction
* store instruction
* SYNC instruction
* CACHE instruction
* PREF instruction
* external intervention exclusive or invalidate to the secondary cache block containing the linked address
3. The SC stores a new value into the memory word, unless the new value has been modified. If the word has not been modified, the store succeeds and a 1 is stored in the destination register. Otherwise the Store Conditional fails, memory is not modified, and a 0 is loaded into the destination register. Since the instruction format has only a single field to select a data register (rt), this destination register is the same as the register which was stored.
Load Linked and Store Conditional instructions (LL, LLD, SC, and SCD) do not implicitly perform SYNC operations in the R10000 processor.
SYNC
The SYNC instruction is implemented in a "lightweight" manner: after decoding a SYNC instruction, the processor continues to fetch and decode further instructions. It is allowed to issue load and store instructions speculatively and out-of-order, following a SYNC.
The R10000 processor only allows a SYNC instruction to graduate when the following conditions are met:
* all previous instructions have been successfully completed
* the uncached buffer does not contain any uncached stores
* the address cycle of a processor double/single/partial-word write request resulting from an uncached store was not issued to the System interface in any of the prior three SysClk cycles
* the SysGblPerf* signal is asserted
A SYNC instruction is not prevented from graduating if the uncached buffer contains any uncached accelerated stores.
1.9 Performance
As it executes programs, the R10000 superscalar processor performs many operations in parallel. Instructions can also be executed out of order. Together, these two facts greatly improve performance, but they also make it difficult to predict the time required to execute any section of a program, since it often depends on the instruction mix and the critical dependencies between instructions.
The processor has five largely independent execution units, each of which are individualized for a specific class of instructions. Any one of these units may limit processor performance, even as the other units sit idle. If this occurs, instructions which use the idle units can be added to the program without adding any appreciable delay.
User Instruction Latency and Repeat Rate
Table 1-2 shows the latencies and repeat rates for all user instructions executed in ALU1, ALU2, Load/Store, Floating-Point Add and Floating-Point Multiply functional units (definitions of latency and repeat rate are given in the Glossary). Kernel instructions are not included, nor are control instructions not issued to these execution units.
Table 1-2
Please note the following about Table 1-2:
* For integer instructions, conditional trap evaluation takes a single cycle, like conditional branches.
* Branches and conditional moves are not conditionally issued.
* The repeat rate above for Load/Store does not include Load Link and Store Conditional.
* Prefetch instruction is not included here.
* The latency for multiplication and division depends upon the next instruction.
* An instruction using register Lo can be issued one cycle earlier than one using Hi.
* For floating-point instructions, CP1 branches are evaluated in the Graduation Unit.
* Up to four branches can be evaluated at one cycle.*3
* The repeat pattern for the CVT.S.(W/L) is "I I x x I I x x ..."; the repeat rate given here, 2, is the average.
* The latency for MADD instructions is 2 cycles if the result is used as the operand specified by fr of the second MADD instruction.
* Load Linked and Store Conditional instructions (LL, LLD, SC, and SCD) do not implicitly perform SYNC operations in the R10000. Any of the following events that occur between a Load Linked and a Store Conditional will cause the Store Conditional to fail: an exception; execution of an ERET, a load, a store, a SYNC, a CacheOp, a prefetch, or an external intervention/invalidation on the block containing the linked address. Instruction cache misses do not cause the Store Conditional to fail.
For more information about implementations of the LL, SC, and SYNC instructions, please see the section titled, R10000-Specific CPU Instructions, in this chapter.
Other Performance Issues
Table 1-2 shows execution times within the functional units only. Performance may also be affected by instruction fetch times, and especially by the execution of conditional branches.
In an effort to keep the execution units busy, the processor predicts branches and speculatively executes instructions along the predicted path. When the branch is predicted correctly, this significantly improves performance: for typical programs, branch prediction is 85% to 90% correct. When a branch is mispredicted, the processor must discard instructions which were speculatively fetched and executed. Usually, this effort uses resources which otherwise would have been idle, however in some cases speculative instructions can delay previous instructions.
Cache Performance
The execution of load and store instructions can greatly affect performance. These instructions are executed quickly if the required memory block is contained in the primary data cache, otherwise there are significant delays for accessing the secondary cache or main memory. Out-of-order execution and non-blocking caches reduce the performance loss due to these delays, however.
The latency and repeat rates for accessing the secondary cache are summarized in Table 1-3. These rates depend on the ratio of the secondary cache's clock to the processor's internal pipeline clock. The best performance is achieved when the clock rates are equal; slower external clocks add to latency and repeat times.
The primary data cache contains 8-word blocks, which are refilled using 2-cycle transfers from the quadword-wide secondary cache. Latency runs to the time in which the processor can use the addressed data.
The primary instruction cache contains 16-word blocks, which are refilled using 4-cycle transfers.
Table 1-3 Latency and Repeat Rates for Secondary Cache Reads
The processor mitigates access delays to the secondary cache in the following ways:
* The processor can execute up to 16 load and store instructions speculatively and out-of-order, using non-blocking primary and secondary caches. That is, it looks ahead in its instruction stream to find load and store instructions which can be executed early; if the addressed data blocks are not in the primary cache, the processor initiates cache refills as soon as possible.
* If a speculatively executed load initiates a cache refill, the refill is completed even if the load instruction is aborted. It is likely the data will be referenced again.
* The data cache is interleaved between two banks, each of which contains independent tag and data arrays. These four sections can be allocated separately to achieve high utilization. Five separate circuits compete for cache bandwidth (address calculate, tag check, load unit, store unit, external interface.)
* The external interface gives priority to its refill and interrogate operations. The processor can execute tag checks, data reads for load instructions, or data writes for store instructions. When the primary cache is refilled, any required data can be streamed directly to waiting load instructions.
* The external interface can handle up to four non-blocking memory accesses to secondary cache and main memory.
Main memory typically has much longer latencies and lower bandwidth than the secondary cache, which make it difficult for the processor to mitigate their effect. Since main memory accesses are non-blocking, delays can be reduced by overlapping the latency of several operations. However, although the first part of the latency may be concealed, the processor cannot look far enough ahead to hide the entire latency.
Programmers may use pre-fetch instructions to load data into the caches before it is needed, greatly reducing main memory delays for programs which access memory in a predictable sequence.
faSystem Configurations
2
The R10000 processor provides the capability for a wide range of computer systems; this chapter describes some of the uni- and multiprocessor alternatives.
2.1 Uniprocessor Systems
In a typical uniprocessor system, the System interface of the R10000 processor connects in a point-to-point fashion with an external agent. Such a system is shown in Figure 2-1. The external agent is typically an ASIC that provides a gateway to the memory and I/O subsystems; in fact, this ASIC may incorporate the memory controller itself.
If hardware I/O coherency is desired, the external agent may use the multiprocessor primitives provided by the processor to maintain cache coherency for interventions and invalidations. External duplicate tags can be used by the external agent to filter external coherency requests.
Figure 2-1 Uniprocessor System Organization
2.2 Multiprocessor Systems
Two types of multiprocessor systems can be implemented with R10000 processor:
* a dedicated external agent interfaces with each R10000 processor
* up to four R10000 processors and an external agent reside on a cluster bus
Multiprocessor Systems Using Dedicated External Agents
A multiprocessor system may be created with R10000 processors by providing a dedicated external agent for each processor; such a system is shown in Figure 2-2. The external agent provides a path between the processor System interface and some type of coherent interconnect. In such a system, the processor provides support for three coherency schemes:
* snoopy-based
* snoopy-based with external duplicate tags and control
* directory-based with external directory structure and control
Figure 2-2 Multiprocessor System Organization using Dedicated External Agents
Multiprocessor Systems Using a Cluster Bus
A multiprocessor system may be created with R10000 processors by using a cluster bus configuration. Such a system is shown in Figure 2-3. A cluster bus is created by attaching the System interfaces of up to four R10000 processors with an external agent (the cluster coordinator). The cluster coordinator is responsible for managing the flow of data within the cluster.
This organization can reduce the number of ASICs and the pin count needed for a small multiprocessor systems.
The cluster bus protocol supports three coherency schemes:
* snoopy-based
* snoopy-based with external duplicate tags and control
* directory-based with external directory structure and control
Figure 2-3 Multiprocessor System Organization Using the Cluster Bus
Interface Signal Descriptions
3
This chapter gives a list and description of the interface signals.
The R10000 interface signals may be divided into the following groups:
* Power interface
* Secondary Cache interface
* System interface
* Test interface
The following sections present a summary of the external interface signals for each of these groups. An asterisk (*) indicates signals that are asserted as a logical 0.
3.1 Power Interface Signals
Table 3-1 presents the R10000 processor power interface signals.
Table 3-1
3.2 Secondary Cache Interface Signals
Table 3-2 presents the R10000 processor secondary cache interface signals.
Table 3-2
3.3 System Interface Signals
Table 3-3 presents the R10000 processor System interface signals.
Table 3-3 Default
![]()
![]()
![]()
![]()
![]()
4-way multiprocessor systems. 8-bit ECC Error Check and Correction is made on address and data transfers.![]()
Default
Stage 1![]()
![]()
![]()
Latencies and Repeat Rates for User InstructionsDefault
* CTC1 and CFC1 are not included in this table.![]()
![]()
![]()
![]()
![]()
Power Interface Signals![]()
Secondary Cache Interface Signals