Default



MIPS R10000 Microprocessor
User's Manual

Version 1.0

Copyright (C) 1995 MIPS Technologies, Inc.

ALL RIGHTS RESERVED

U.S. GOVERNMENT RESTRICTED RIGHTS LEGEND

Use, duplication or disclosure by the Government is subject to restrictions as set forth in FAR 52.227.19(c)(2) or subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and/or in similar or successor clauses in the FAR, or the DOD or NASA FAR Supplement. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94039-7311.

RISCompiler, RISC/os, R2000, R6000, R4000, R4400, and R10000 are trademarks of MIPS Technologies, Inc. MIPS and R3000 are registered trademarks of MIPS Technologies, Inc.

UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd.

MIPS Technologies, Inc.

2011 North Shoreline

Mountain View, California 94039-7311

http://www.mips.com

Acknowledgments

This book represents a consortium of efforts, and is principally derived from material provided by Randy Martin, Yung-Chin Chen, and Ken Yeager.

Thanks also to Randy for his many painstaking reviews of this manual.

Also providing invaluable service were the following:

Shabbir Latif, for once again running point between Engineering and Publications, answering questions, and presenting tutorials to clarify the complicated details of the R10000 processor operations.

Charlie Price, for use of his rejuvenated MIPS-4 Instruction Set Architecture.

Steve Proffitt, for both his technical assistance, and helping handle the multitude of niggling details involved in getting this manual printed.

The following also provided technical help in innumerable ways: Arun Mehta, Tim Layman, Greg Shippen, Yeffi Van Atta, John Brennan, Len Widra, Roy Johnson, Hector Sucar, Hong-Men Su, Mazin Khurshid, Steve Whitney, Doug Yanagawa (chip illustrations and socket pinouts), Mike Gupta, Steven Peltier, Rob Conrad, Hai Nguyen, Bill Voegtli, and Sharad Mehrotra at the University of Illinois.

Remediating a prior deficiency, thanks to Tom McReynolds.

In Production and Creative, thanks to Melissa Miller for her design of the cover; Yen Nguyen, for handling the printing; both Kay Maitz and Beth Fraker for resolving various design issues; and Michael Ritchie for tracking progress.

Joe Heinrich

June, 1995

Mt. View, California

About This Manual

This manual describes the MIPS R10000 RISC microprocessor (also referred to as the processor in this book).

Glossary

Certain specialized terms used in this book are defined in the Glossary at the end of this manual.

Stylistic Conventions

A brief note on some of the stylistic conventions used in this book: bits, fields, and registers of interest from a software perspective are italicized (such as the BE bit in the Config register).

Signal names of more importance from a hardware point of view are rendered in bold (such as Reset*). The asterisk appended to the signal name (as in Reset*) indicates the signal is low-active.

A range of bits uses a colon as a separator; for instance, (15:0) represents the 16-bit range that runs from bit 0, inclusive, through bit 15. In some places an ellipsis (15...0) or partial ellipsis (15..0) may used in place of a colon for visibility.

Unfamiliar terms presented for the first time are printed in bold letters, and are followed as closely as possible by a definition or description.

Getting MIPS Documents On-Line

The information in this manual, and other MIPS-related product information, is also available over the Word Wide Web at:

http://www.mips.com/

Requests can also be e-mailed to webmaster@mti.sgi.com.

MIPS documents are also available for retrieval on-line, through the file transport protocol (FTP). To retrieve them, follow the steps below. The text you are to type is shown in Courier Bold font; the computer's responses are in shown in Courier Regular font.

1. First, place yourself in the directory on your system within which you want to store the retrieved files. Do this by typing:

cd <directory_you_want_file_to_be_in>

2. Access the MIPS document server, sgigate, through FTP by typing:

ftp sgigate.sgi.com

3. The server tells you when you are connected for FTP by responding:

Connected to sgigate.sgi.com.

4. Next (after some announcements) the server asks you to log in by requesting a name and then a password.

Name (sgigate.sgi.com:<login_name>):

5. Login by typing anonymous for your name and your electronic mail address for your password.

Name (sgigate.sgi.com:<login_name>): anonymous

331 Guest login ok, type your name as password.

Password: your_email_address

6. The system indicates you have logged in by supplying an FTP prompt:

ftp>

7. Go to the pub/doc directory by typing:

ftp> cd pub/doc

8. You can take a look at the contents of the doc directory by listing them:

ftp> ls

9. You will find several subdirectories, such as R4200, R8000, and R10000. When you find the subdirectory you want, cd into that subdirectory and retrieve the file you want by typing:

get <filename>

10. This copies the file from sgigate back to your system.

11. When you have retrieved all the files you want, exit from ftp by typing quit:

ftp> quit

12. If the file was encoded for transmission, you must decode it, after retrieval, by typing:

uudecode <filename>

13. If the file was compressed for transmission, you must uncompress it, after retrieval, by typing:

uncompress <filename>

14. If you tarred the file, type:

tar xvof <filename>

Introduction to the R10000 Processor

1

This user's manual describes the R10000 superscalar microprocessor for the system designer, paying special attention to the external interface and the transfer protocols.

This chapter describes the following:

* MIPS ISA

* what makes a generic superscalar microprocessor

* specifics of the R10000 superscalar microprocessor

* implementation-specific CPU instructions

1.1 MIPS Instruction Set Architecture (ISA)

MIPS has defined an instruction set architecture (ISA), implemented in the following sets of CPU designs:

* MIPS I, implemented in the R2000 and R3000

* MIPS II, implemented in the R6000

* MIPS III, implemented in the R4400

* MIPS IV, implemented in the R8000 and R10000

The original MIPS I CPU ISA has been extended forward three times, as shown in Figure 1-1; each extension is backward compatible. The ISA extensions are inclusive; each new architecture level (or version) includes the former levels.*1

Default

Default



Figure 1-1 MIPS ISA with Extensions

The practical result is that a processor implementing MIPS IV is also able to run MIPS I, MIPS II, or MIPS III binary programs without change.

1.2 What is a Superscalar Processor?

A superscalar processor is one that can fetch, execute and complete more than one instruction in parallel.

Pipeline and Superpipeline Architecture

Previous MIPS processors had linear pipeline architectures; an example of such a linear pipeline is the R4400 superpipeline, shown in Figure 1-2. In the R4400 superpipeline architecture, an instruction is executed each cycle of the pipeline clock (PCycle), or each pipe stage.



Figure 1-2 R4400 Pipeline

Superscalar Architecture

The structure of 4-way superscalar pipeline is shown in Figure 1-3. At each stage, four instructions are handled in parallel. Note that there is only one EX stage for integers.



Figure 1-3 4-Way Superscalar Pipeline

1.3 What is an R10000 Microprocessor?

The R10000 processor is a single-chip superscalar RISC microprocessor that is a follow-on to the MIPS RISC processor family that includes, chronologically, the R2000, R3000, R6000, R4400, and R8000.

The R10000 processor uses the MIPS ANDES architecture, or Architecture with Non-sequential Dynamic Execution Scheduling.

The R10000 processor has the following major features (terms in bold are defined in the Glossary):

* it implements the 64-bit MIPS IV instruction set architecture (ISA)

* it can decode four instructions each pipeline cycle, appending them to one of three instruction queues

* it has five execution pipelines connected to separate internal integer and floating-point execution (or functional) units

* it uses dynamic instruction scheduling and out-of-order execution

* it uses speculative instruction issue (also termed "speculative branching")

* it uses a precise exception model (exceptions can be traced back to the instruction that caused them)

* it uses non-blocking caches

* it has separate on-chip 32-Kbyte primary instruction and data caches

* it has individually-optimized secondary cache and System interface ports

* it has an internal controller for the external secondary cache

* it has an internal System interface controller with multiprocessor support

The R10000 processor is implemented using 0.35-micron CMOS VLSI circuitry on a single 17 mm-by-18 mm chip that contains about 6.8 million transistors, including about 4.4 million transistors in its primary caches.

R10000 Superscalar Pipeline

The R10000 superscalar processor fetches and decodes four instructions in parallel each cycle (or pipeline stage). Each pipeline includes stages for fetching (stage 1 in Figure 1-4), decoding (stage 2) issuing instructions (stage 3), reading register operands (stage 3), executing instructions (stages 4 through 6), and storing results (stage 7).



Figure 1-4 Superscalar Pipeline Architecture in the R10000

Instruction Queues

As shown in Figure 1-4, each instruction decoded in stage 2 is appended to one of three instruction queues:

* integer queue

* address queue

* floating-point queue

Execution Pipelines

The three instruction queues can issue (see the Glossary for a definition of issue) one new instruction per cycle to each of the five execution pipelines:

* the integer queue issues instructions to the two integer ALU pipelines

* the address queue issues one instruction to the Load/Store Unit pipeline

* the floating-point queue issues instructions to the floating-point adder and multiplier pipelines

A sixth pipeline, the fetch pipeline, reads and decodes instructions from the instruction cache.

64-bit Integer ALU Pipeline

The 64-bit integer pipeline has the following characteristics:

* it has a 16-entry integer instruction queue that dynamically issues instructions

* it has a 64-bit 64-location integer physical register file, with seven read and three write ports (32 logical registers; see register renaming in the Glossary)

* it has two 64-bit arithmetic logic units:

- ALU1 contains an arithmetic-logic unit, shifter, and integer branch comparator

- ALU2 contains an arithmetic-logic unit, integer multiplier, and divider

Load/Store Pipeline

The load/store pipeline has the following characteristics:

* it has a 16-entry address queue that dynamically issues instructions, and uses the integer register file for base and index registers

* it has a 16-entry address stack for use by non-blocking loads and stores

* it has a 44-bit virtual address calculation unit

* it has a 64-entry fully associative Translation-Lookaside Buffer (TLB), which converts virtual addresses to physical addresses, using a 40-bit physical address. Each entry maps two pages, with sizes ranging from 4 Kbytes to 16 Mbytes, in powers of 4.

64-bit Floating-Point Pipeline

The 64-bit floating-point pipeline has the following characteristics:

* it has a 16-entry instruction queue, with dynamic issue

* it has a 64-bit 64-location floating-point physical register file, with five read and three write ports (32 logical registers)

* it has a 64-bit parallel multiply unit (3-cycle pipeline with 2-cycle latency) which also performs move instructions

* it has a 64-bit add unit (3-cycle pipeline with 2-cycle latency) which handles addition, subtraction, and miscellaneous floating-point operations

* it has separate 64-bit divide and square-root units which can operate concurrently (these units share their issue and completion logic with the floating-point multiplier)

A block diagram of the processor and its interfaces is shown in Figure 1-5, followed by a description of its major logical blocks.



Figure 1-5 Block Diagram of the R10000 Processor

Functional Units

The five execution pipelines allow overlapped instruction execution by issuing instructions to the following five functional units:

* two integer ALUs (ALU1 and ALU2)

* the Load/Store unit (address calculate)

* the floating-point adder

* the floating-point multiplier

There are also three "iterative" units to compute more complex results:

* Integer multiply and divide operations are performed by an Integer Multiply/Divide execution unit; these instructions are issued to ALU2. ALU2 remains busy for the duration of the divide.

* Floating-point divides are performed by the Divide execution unit; these instructions are issued to the floating-point multiplier.

* Floating-point square root are performed by the Square-root execution unit; these instructions are issued to the floating-point multiplier.

Primary Instruction Cache (I-cache)

The primary instruction cache has the following characteristics:

* it contains 32 Kbytes, organized into 16-word blocks, is 2-way set associative, using a least-recently used (LRU) replacement algorithm

* it reads four consecutive instructions per cycle, beginning on any word boundary within a cache block, but cannot fetch across a block boundary.

* its instructions are predecoded, its fields are rearranged, and a 4-bit unit select code is appended

* it checks parity on each word

* it permits non-blocking instruction fetch

Primary Data Cache (D-cache)

The primary data cache has the following characteristics:

* it has two interleaved arrays (two 16 Kbyte ways)

* it contains 32 Kbytes, organized into 8-word blocks, is 2-way set associative, using an LRU replacement algorithm.

* it handles 64-bit load/store operations

* it handles 128-bit refill or write-back operations

* it permits non-blocking loads and stores

* it checks parity on each byte

Instruction Decode And Rename Unit

The instruction decode and rename unit has the following characteristics:

* it processes 4 instructions in parallel

* it replaces logical register numbers with physical register numbers (register renaming)

- it maps integer registers into a 33-word-by-6-bit mapping table that has 4 write and 12 read ports

- it maps floating-point registers into a 32-word-by-6-bit mapping table that has 4 write and 16 read ports

* it has a 32-entry active list of all instructions within the pipeline.

Branch Unit

The branch unit has the following characteristics:

* it allows one branch per cycle

* conditional branches can be executed speculatively, up to 4-deep

* it has a 44-bit adder to compute branch addresses

* it has a 4-quadword branch-resume buffer, used for reversing mispredicted speculatively-taken branches

* the Branch Link Quadword contains four instructions following a subroutine call, for rapid use when returning from leaf subroutines

* it has program trace RAM that stores the program counter for each instruction in the pipeline

External Interfaces

The external interfaces have the following characteristics:

* a 64-bit System interface allows direct-connection for 2-way to
4-way multiprocessor systems. 8-bit ECC Error Check and Correction is made on address and data transfers.

* a secondary cache interface with 128-bit data path and tag fields. 9-bit ECC Error Check and Correction is made on data quadwords, 7-bit ECC is made on tag words. It allows connection to an external secondary cache that can range from 512 Kbytes to 16 Mbytes, using external static RAMs. The secondary cache can be organized into either 16- or 32-word blocks, and is 2-way set associative.

Bit definitions are given in Chapter 3.

1.4 Instruction Queues

The processor keeps decoded instructions in three instruction queues, which dynamically issue instructions to the execution units. The queues allow the processor to fetch instructions at its maximum rate, without stalling because of instruction conflicts or dependencies.

Each queue uses instruction tags to keep track of the instruction in each execution pipeline stage. These tags set a Done bit in the active list as each instruction is completed.

Integer Queue

The integer queue issues instructions to the two integer arithmetic units: ALU1 and ALU2.

The integer queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded integer instructions are written into empty entries in no particular order. Instructions remain in this queue only until they have been issued to an ALU.

Branch and shift instructions can be issued only to ALU1. Integer multiply and divide instructions can be issued only to ALU2. Other integer instructions can be issued to either ALU.

The integer queue controls six dedicated ports to the integer register file: two operand read ports and a destination write port for each ALU.

Floating-Point Queue

The floating-point queue issues instructions to the floating-point multiplier and the floating-point adder.

The floating-point queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded floating-point instructions are written into empty entries in random order. Instructions remain in this queue only until they have been issued to a floating-point execution unit.

The floating-point queue controls six dedicated ports to the floating-point register file: two operand read ports and a destination port for each execution unit.

The floating-point queue uses the multiplier's issue port to issue instructions to the square-root and divide units. These instructions also share the multiplier's register ports.

The floating-point queue contains simple sequencing logic for multiple-pass instructions such as Multiply-Add. These instructions require one pass through the multiplier, then one pass through the adder.

Address Queue

The address queue issues instructions to the load/store unit.

The address queue contains 16 instruction entries. Unlike the other two queues, the address queue is organized as a circular First-In First-Out (FIFO) buffer. A newly decoded load/store instruction is written into the next available sequential empty entry; up to four instructions may be written during each cycle.

The FIFO order maintains the program's original instruction sequence so that memory address dependencies may be easily computed.

Instructions remain in this queue until they have graduated; they cannot be deleted immediately after being issued, since the load/store unit may not be able to complete the operation immediately.

The address queue contains more complex control logic than the other queues. An issued instruction may fail to complete because of a memory dependency, a cache miss, or a resource conflict; in these cases, the queue must continue to reissue the instruction until it is completed.

The address queue has three issue ports:

* First, it issues each instruction once to the address calculation unit. This unit uses a 2-stage pipeline to compute the instruction's memory address and to translate it in the TLB. Addresses are stored in the address stack and in the queue's dependency logic. This port controls two dedicated read ports to the integer register file. If the cache is available, it is accessed at the same time as the TLB. A tag check can be performed even if the data array is busy.

* Second, the address queue can re-issue accesses to the data cache. The queue allocates usage of the four sections of the cache, which consist of the tag and data sections of the two cache banks. Load and store instructions begin with a tag check cycle, which checks to see if the desired address is already in cache. If it is not, a refill operation is initiated, and this instruction waits until it has completed. Load instructions also read and align a doubleword value from the data array. This access may be either concurrent to or subsequent to the tag check. If the data is present and no dependencies exist, the instruction is marked done in the queue.

* Third, the address queue can issue store instructions to the data cache. A store instruction may not modify the data cache until it graduates. Only one store can graduate per cycle, but it may be anywhere within the four oldest instructions, if all previous instructions are already completed.

The access and store ports share four register file ports (integer read and write, floating-point read and write). These shared ports are also used for Jump and Link and Jump Register instructions, and for move instructions between the integer and register files.

1.5 Program Order and Dependencies

From a programmer's perspective, instructions appear to execute sequentially, since they are fetched and graduated in program order (the order they are presented to the processor by software). When an instruction stores a new value in its destination register, that new value is immediately available for use by subsequent instructions.

Internal to the processor, however, instructions are executed dynamically, and some results may not be available for many cycles; yet the hardware must behave as if each instruction is executed sequentially.

This section describes various conditions and dependencies that can arise from them in pipeline operation, including:

* instruction dependencies

* execution order and stalling

* branch prediction and speculative execution

* resolving operand dependencies

* resolving exception dependencies

Instruction Dependencies

Each instruction depends on all previous instructions which produced its operands, because it cannot begin execution until those operands become valid. These dependencies determine the order in which instructions can be executed.

Execution Order and Stalling

The actual execution order depends on the processor's organization; in a typical pipelined processor, instructions are executed only in program order. That is, the next sequential instruction may begin execution during the next cycle, if all of its operands are valid. Otherwise, the pipeline stalls until the operands do become valid.

Since instructions execute in order, stalls usually delay all subsequent instructions.

A clever compiler can improve performance by re-arranging instructions to reduce the frequency of these stall cycles.

* In an in-order superscalar processor, several consecutive instructions may begin execution simultaneously, if all their operands are valid, but the processor stalls at any instruction whose operands are still busy.

* In an out-of-order superscalar processor, such as the R10000, instructions are decoded and stored in queues. Each instruction is eligible to begin execution as soon as its operands become valid, independent of the original instruction sequence. In effect, the hardware rearranges instructions to keep its execution units busy. This process is called dynamic issuing.

Branch Prediction and Speculative Execution

Although one or more instructions may begin execution during each cycle, each instruction takes several (or many) cycles to complete. Thus, when a branch instruction is decoded, its branch condition may not yet be known. However, the R10000 processor can predict whether the branch is taken, and then continue decoding and executing subsequent instructions along the predicted path.

When a branch prediction is wrong, the processor must back up to the original branch and take the other path. This technique is called speculative execution. Whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores to the processor to the state it held before the branch.

Resolving Operand Dependencies

Operands include registers, memory, and condition bits. Each operand type has its own dependency logic. In the R10000 processor, dependencies are resolved in the following manner:

* register dependencies are resolved by using register renaming and the associative comparator circuitry in the queues

* memory dependencies are resolved in the Load/Store Unit

* condition bit dependencies are resolved in the active list and instruction queues

Resolving Exception Dependencies

In addition to operand dependencies, each instruction is implicitly dependent upon any previous instruction that generates an exception. Exceptions are caused whenever an instruction cannot be properly completed, and are usually due to either an untranslated virtual address or an erroneous operand.

The processor design implements precise exceptions, by:

* identifying the instruction which caused the exception

* preventing the exception-causing instruction from graduating

* aborting all subsequent instructions

Thus, all register values remain the same as if instructions were executed singly. Effectively, all previous instructions are completed, but the faulting instruction and all subsequent instructions do not modify any values.

Strong Ordering

A multiprocessor system that exhibits the same behavior as a uniprocessor system in a multiprogramming environment is said to be strongly ordered.

The R10000 processor behaves as if strong ordering is implemented, although it does not actually execute all memory operations in strict program order.

In the R10000 processor, store operations remain pending until the store instruction is ready to graduate. Thus, stores are executed in program order, and memory values are precise following any exception.

For improved performance however, cached load operations my occur in any order, subject to memory dependencies on pending store instructions. To maintain the appearance of strong ordering, the processor detects whenever the reordering of a cached load might alter the operation of the program, backs up, and then re-executes the affected load instructions. Specifically, whenever a primary data cache block is invalidated due to an external coherency request, its index is compared with all outstanding load instructions. If there is a match and the load has been completed, the load is prevented from graduating. When it is ready to graduate, the entire pipeline is flushed, and the processor is restored to the state it had before the load was decoded.

An uncached or uncached accelerated load or store instruction is executed when the instruction is ready to graduate. This guarantees strong ordering for uncached accesses.

Since the R10000 processor behaves as if it implemented strong ordering, a suitable system design allows the processor to be used to create a shared-memory multiprocessor system with strong ordering.

An Example of Strong Ordering

Given that locations X and Y have no particular relationship--that is, they are not in the same cache block--an example of strong ordering is as follows:

* Processor A performs a store to location X and later executes a load from location Y.

* Processor B performs a store to location Y and later executes a load from location X.

The two processors are running asynchronously, and the order of the above two sequences is unknown.

For the system to be strongly ordered, either processor A must load the new value of Y, or processor B must load the new value of X, or both processors A and B must load the new values of Y and X, respectively, under all conditions.

If processors A and B both load old values of Y and X, respectively, under any conditions, the system is not strongly ordered.



1.6 R10000 Pipelines

This section describes the stages of the superscalar pipeline.

Instructions are processed in six partially-independent pipelines, as shown in Figure 1-4. The Fetch pipeline reads instructions from the instruction cache*2, decodes them, renames their registers, and places them in three instruction queues. The instruction queues contain integer, address calculate, and floating-point instructions. From these queues, instructions are dynamically issued to the five pipelined execution units.

Default

Default

Stage 1

In stage 1, the processor fetches four instructions each cycle, independent of their alignment in the instruction cache -- except that the processor cannot fetch across a 16-word cache block boundary. These words are then aligned in the 4-word Instruction register.

If any instructions were left from the previous decode cycle, they are merged with new words from the instruction cache to fill the Instruction register.

Stage 2

In stage 2, the four instructions in the Instruction register are decoded and renamed. (Renaming determines any dependencies between instructions and provides precise exception handling.) When renamed, the logical registers referenced in an instruction are mapped to physical registers. Integer and floating-point registers are renamed independently.

A logical register is mapped to a new physical register whenever that logical register is the destination of an instruction. Thus, when an instruction places a new value in a logical register, that logical register is renamed (mapped) to a new physical register, while its previous value is retained in the old physical register.

As each instruction is renamed, its logical register numbers are compared to determine if any dependencies exist between the four instructions decoded during this cycle. After the physical register numbers become known, the Physical Register Busy table indicates whether or not each operand is valid. The renamed instructions are loaded into integer or floating-point instruction queues.

Only one branch instruction can be executed during stage 2. If the instruction register contains a second branch instruction, this branch is not decoded until the next cycle.

The branch unit determines the next address for the Program Counter; if a branch is taken and then reversed, the branch resume cache provides the instructions to be decoded during the next cycle.

Stage 3

In stage 3, decoded instructions are written into the queues. Stage 3 is also the start of each of the five execution pipelines.

Stages 4-6

In stages 4 through 6, instructions are executed in the various functional units. These units and their execution process are described below.

Floating-Point Multiplier (3-stage Pipeline)

Single- or double-precision multiply and conditional move operations are executed in this unit with a 2-cycle latency and a 1-cycle repeat rate. The multiplication is completed during the first two cycles; the third cycle is used to pack and transfer the result.

Floating-Point Divide and Square-Root Units

Single- or double-precision division and square-root operations can be executed in parallel by separate units. These units share their issue and completion logic with the floating-point multiplier.

Floating-Point Adder (3-stage Pipeline)

Single- or double-precision add, subtract, compare, or convert operations are executed with a 2-cycle latency and a 1-cycle repeat rate. Although a final result is not calculated until the third pipeline stage, internal bypass paths set a 2-cycle latency for dependent add or multiply instructions.

Integer ALU1 (1-stage Pipeline)

Integer add, subtract, shift, and logic operations are executed with a 1-cycle latency and a 1-cycle repeat rate. This ALU also verifies predictions made for branches that are conditional on integer register values.

Integer ALU2 (1-stage Pipeline)

Integer add, subtract, and logic operations are executed with a 1-cycle latency and a 1-cycle repeat rate. Integer multiply and divide operations take more than one cycle.

Address Calculation and Translation in the TLB

A single memory address can be calculated every cycle for use by either an integer or floating-point load or store instruction. Address calculation and load operations can be calculated out of program order.

The calculated address is translated from a 44-bit virtual address into a 40-bit physical address using a translation-lookaside buffer. The TLB contains 64 entries, each of which can translate two pages. Each entry can select a page size ranging from 4 Kbytes to 16 Mbytes, inclusive, in increments of 4, as shown in Figure 1-6.



Figure 1-6 TLB Page Sizes

Load instructions have a 2-cycle latency if the addressed data is already within the data cache.

Store instructions do not modify the data cache or memory until they graduate.

1.7 Implications of R10000 Microarchitecture on Software

The R10000 processor implements the MIPS architecture by using the following techniques to improve throughput:

* superscalar instruction issue

* speculative execution

* non-blocking caches

These microarchitectural techniques have special implications for compilation and code scheduling.

Superscalar Instruction Issue

The R10000 processor has parallel functional units, allowing up to four instructions to be fetched and up to five instructions to be issued or completed each cycle. An ideal code stream would match the fetch bandwidth of the processor with a mix of independent instructions to keep the functional units as busy as possible.

To create this ideal mix, every cycle the hardware would select one instruction from each of the columns below. (Floating-point divide, floating-point square root, integer multiply and integer divide cannot be started on each cycle.) The processor can look ahead in the code, so the mix should be kept close to the ideal described below.

Column A Column B Column C Column D Column E

FPadd FP mul FPload add/sub add/sub

FPdiv FPstore shift mul

FPsqrt load branch div

store logical logical

Data dependencies are detected in hardware, but limit the degree of parallelism that can be achieved. Compilers can intermix instructions from independent code streams.

Speculative Execution

Speculative execution increases parallelism by fetching, issuing, and completing instructions even in the presence of unresolved conditional branches and possible exceptions. Following are some suggestions for increasing program efficiency:

* Compilers should reduce the number of branches as much as possible

* "Jump Register" instructions should be avoided.

* Aggressive use of the new integer and floating point conditional move instructions is recommended.

* Branch prediction rates may be improved by organizing code so that each branch goes the same direction most of the time, since a branch that is taken 50% of the time has higher average cost than one taken 90% of the time. The MIPS IV conditional move instructions may be effective in improving performance by replacing unpredictable branches.

Nonblocking Caches

As processor speed increases, the processor's data latency and bandwidth requirements rise more rapidly than the latency and bandwidth of cost-effective main memory systems. The memory hierarchy of the R10000 processor tries to minimize this effect by using large set-associative caches and higher bandwidth cache refills to reduce the cost of loads, stores, and instruction fetches. Unlike the R4400, the R10000 processor does not stall on data cache misses, instead defers execution of any dependent instructions until the data has been returned and continues to execute independent instructions (including other memory operations that may miss in the cache). Although the R10000 allows a number of outstanding primary and secondary cache misses, compilers should organize code and data to reduce cache misses. When cache misses are inevitable, the data reference should be scheduled as early as possible so that the data can be fetched in parallel with other unrelated operations.

As a further antidote to cache miss stalls, the R10000 processor supports prefetch instructions, which serve as hints to the processor to move data from memory into the secondary and primary caches when possible. Because prefetches do not cause dependency stalls or memory management exceptions, they can be scheduled as soon as the data address can be computed, without affecting exception semantics. Indiscriminate use of prefetch instructions can slow program execution because of the instruction-issue overhead, but selective use of prefetches based on compiler miss prediction can yield significant performance improvement for dense matrix computations.

1.8 R10000-Specific CPU Instructions

This section describes the processor-specific implementations of the following instructions:

* PREF

* LL/SC

* SYNC

Chapter 14 describes the CP0-specific instructions, and Chapter 15 describes the FPU-specific instructions.

PREF

In the R1000 processor, the Prefetch instruction, PREF, attempts to fetch data into the secondary and primary data caches. The action taken by a Prefetch instruction is controlled by the instruction hint field, as decoded in Table 1-1.

Table 1-1 PREF Instruction Hint Field

For a "store" Prefetch, an Exclusive copy of the cache block must be obtained, in order that it may be written.

LL/SC

Load Linked and Store Conditional instructions are used together to implement a memory semaphore. Each LL/SC sequence has three sections:

1. The LL loads a word from memory.

2. A short sequence of instructions checks or modifies this word. This sequence must not contain any of the events listed below, or the Store Conditional will fail:

* exception

* execution of ERET

* load instruction

* store instruction

* SYNC instruction

* CACHE instruction

* PREF instruction

* external intervention exclusive or invalidate to the secondary cache block containing the linked address

3. The SC stores a new value into the memory word, unless the new value has been modified. If the word has not been modified, the store succeeds and a 1 is stored in the destination register. Otherwise the Store Conditional fails, memory is not modified, and a 0 is loaded into the destination register. Since the instruction format has only a single field to select a data register (rt), this destination register is the same as the register which was stored.

Load Linked and Store Conditional instructions (LL, LLD, SC, and SCD) do not implicitly perform SYNC operations in the R10000 processor.

SYNC

The SYNC instruction is implemented in a "lightweight" manner: after decoding a SYNC instruction, the processor continues to fetch and decode further instructions. It is allowed to issue load and store instructions speculatively and out-of-order, following a SYNC.

The R10000 processor only allows a SYNC instruction to graduate when the following conditions are met:

* all previous instructions have been successfully completed

* the uncached buffer does not contain any uncached stores

* the address cycle of a processor double/single/partial-word write request resulting from an uncached store was not issued to the System interface in any of the prior three SysClk cycles

* the SysGblPerf* signal is asserted

A SYNC instruction is not prevented from graduating if the uncached buffer contains any uncached accelerated stores.

1.9 Performance

As it executes programs, the R10000 superscalar processor performs many operations in parallel. Instructions can also be executed out of order. Together, these two facts greatly improve performance, but they also make it difficult to predict the time required to execute any section of a program, since it often depends on the instruction mix and the critical dependencies between instructions.

The processor has five largely independent execution units, each of which are individualized for a specific class of instructions. Any one of these units may limit processor performance, even as the other units sit idle. If this occurs, instructions which use the idle units can be added to the program without adding any appreciable delay.

User Instruction Latency and Repeat Rate

Table 1-2 shows the latencies and repeat rates for all user instructions executed in ALU1, ALU2, Load/Store, Floating-Point Add and Floating-Point Multiply functional units (definitions of latency and repeat rate are given in the Glossary). Kernel instructions are not included, nor are control instructions not issued to these execution units.

Table 1-2

Latencies and Repeat Rates for User Instructions

Please note the following about Table 1-2:

* For integer instructions, conditional trap evaluation takes a single cycle, like conditional branches.

* Branches and conditional moves are not conditionally issued.

* The repeat rate above for Load/Store does not include Load Link and Store Conditional.

* Prefetch instruction is not included here.

* The latency for multiplication and division depends upon the next instruction.

* An instruction using register Lo can be issued one cycle earlier than one using Hi.

* For floating-point instructions, CP1 branches are evaluated in the Graduation Unit.

* Up to four branches can be evaluated at one cycle.*3

Default

Default

* CTC1 and CFC1 are not included in this table.

* The repeat pattern for the CVT.S.(W/L) is "I I x x I I x x ..."; the repeat rate given here, 2, is the average.

* The latency for MADD instructions is 2 cycles if the result is used as the operand specified by fr of the second MADD instruction.

* Load Linked and Store Conditional instructions (LL, LLD, SC, and SCD) do not implicitly perform SYNC operations in the R10000. Any of the following events that occur between a Load Linked and a Store Conditional will cause the Store Conditional to fail: an exception; execution of an ERET, a load, a store, a SYNC, a CacheOp, a prefetch, or an external intervention/invalidation on the block containing the linked address. Instruction cache misses do not cause the Store Conditional to fail.

For more information about implementations of the LL, SC, and SYNC instructions, please see the section titled, R10000-Specific CPU Instructions, in this chapter.

Other Performance Issues

Table 1-2 shows execution times within the functional units only. Performance may also be affected by instruction fetch times, and especially by the execution of conditional branches.

In an effort to keep the execution units busy, the processor predicts branches and speculatively executes instructions along the predicted path. When the branch is predicted correctly, this significantly improves performance: for typical programs, branch prediction is 85% to 90% correct. When a branch is mispredicted, the processor must discard instructions which were speculatively fetched and executed. Usually, this effort uses resources which otherwise would have been idle, however in some cases speculative instructions can delay previous instructions.

Cache Performance

The execution of load and store instructions can greatly affect performance. These instructions are executed quickly if the required memory block is contained in the primary data cache, otherwise there are significant delays for accessing the secondary cache or main memory. Out-of-order execution and non-blocking caches reduce the performance loss due to these delays, however.

The latency and repeat rates for accessing the secondary cache are summarized in Table 1-3. These rates depend on the ratio of the secondary cache's clock to the processor's internal pipeline clock. The best performance is achieved when the clock rates are equal; slower external clocks add to latency and repeat times.

The primary data cache contains 8-word blocks, which are refilled using 2-cycle transfers from the quadword-wide secondary cache. Latency runs to the time in which the processor can use the addressed data.

The primary instruction cache contains 16-word blocks, which are refilled using 4-cycle transfers.

Table 1-3 Latency and Repeat Rates for Secondary Cache Reads

The processor mitigates access delays to the secondary cache in the following ways:

* The processor can execute up to 16 load and store instructions speculatively and out-of-order, using non-blocking primary and secondary caches. That is, it looks ahead in its instruction stream to find load and store instructions which can be executed early; if the addressed data blocks are not in the primary cache, the processor initiates cache refills as soon as possible.

* If a speculatively executed load initiates a cache refill, the refill is completed even if the load instruction is aborted. It is likely the data will be referenced again.

* The data cache is interleaved between two banks, each of which contains independent tag and data arrays. These four sections can be allocated separately to achieve high utilization. Five separate circuits compete for cache bandwidth (address calculate, tag check, load unit, store unit, external interface.)

* The external interface gives priority to its refill and interrogate operations. The processor can execute tag checks, data reads for load instructions, or data writes for store instructions. When the primary cache is refilled, any required data can be streamed directly to waiting load instructions.

* The external interface can handle up to four non-blocking memory accesses to secondary cache and main memory.

Main memory typically has much longer latencies and lower bandwidth than the secondary cache, which make it difficult for the processor to mitigate their effect. Since main memory accesses are non-blocking, delays can be reduced by overlapping the latency of several operations. However, although the first part of the latency may be concealed, the processor cannot look far enough ahead to hide the entire latency.

Programmers may use pre-fetch instructions to load data into the caches before it is needed, greatly reducing main memory delays for programs which access memory in a predictable sequence.

faSystem Configurations

2

The R10000 processor provides the capability for a wide range of computer systems; this chapter describes some of the uni- and multiprocessor alternatives.

2.1 Uniprocessor Systems

In a typical uniprocessor system, the System interface of the R10000 processor connects in a point-to-point fashion with an external agent. Such a system is shown in Figure 2-1. The external agent is typically an ASIC that provides a gateway to the memory and I/O subsystems; in fact, this ASIC may incorporate the memory controller itself.

If hardware I/O coherency is desired, the external agent may use the multiprocessor primitives provided by the processor to maintain cache coherency for interventions and invalidations. External duplicate tags can be used by the external agent to filter external coherency requests.



Figure 2-1 Uniprocessor System Organization

2.2 Multiprocessor Systems

Two types of multiprocessor systems can be implemented with R10000 processor:

* a dedicated external agent interfaces with each R10000 processor

* up to four R10000 processors and an external agent reside on a cluster bus

Multiprocessor Systems Using Dedicated External Agents

A multiprocessor system may be created with R10000 processors by providing a dedicated external agent for each processor; such a system is shown in Figure 2-2. The external agent provides a path between the processor System interface and some type of coherent interconnect. In such a system, the processor provides support for three coherency schemes:

* snoopy-based

* snoopy-based with external duplicate tags and control

* directory-based with external directory structure and control



Figure 2-2 Multiprocessor System Organization using Dedicated External Agents

Multiprocessor Systems Using a Cluster Bus

A multiprocessor system may be created with R10000 processors by using a cluster bus configuration. Such a system is shown in Figure 2-3. A cluster bus is created by attaching the System interfaces of up to four R10000 processors with an external agent (the cluster coordinator). The cluster coordinator is responsible for managing the flow of data within the cluster.

This organization can reduce the number of ASICs and the pin count needed for a small multiprocessor systems.

The cluster bus protocol supports three coherency schemes:

* snoopy-based

* snoopy-based with external duplicate tags and control

* directory-based with external directory structure and control



Figure 2-3 Multiprocessor System Organization Using the Cluster Bus

Interface Signal Descriptions

3

This chapter gives a list and description of the interface signals.

The R10000 interface signals may be divided into the following groups:

* Power interface

* Secondary Cache interface

* System interface

* Test interface

The following sections present a summary of the external interface signals for each of these groups. An asterisk (*) indicates signals that are asserted as a logical 0.

3.1 Power Interface Signals

Table 3-1 presents the R10000 processor power interface signals.

Table 3-1

Power Interface Signals

3.2 Secondary Cache Interface Signals

Table 3-2 presents the R10000 processor secondary cache interface signals.

Table 3-2

Secondary Cache Interface Signals

3.3 System Interface Signals

Table 3-3 presents the R10000 processor System interface signals.

Table 3-3 System Interface Signals

Table 3-3 (cont.) System Interface Signals

3.4 Test Interface Signals

Table 3-4 presents the R10000 processor test interface signals.

Table 3-4

Test Interface Signals

Cache Organization and Coherency

4

The processor implements a two-level cache structure consisting of separate primary instruction and data caches and a joint secondary cache.

Each cache is two-way set associative and uses a write back protocol; that is, two cache blocks are assigned to each set (as shown in Figure 4-1), and a cache store writes data into the cache instead of writing it directly to memory. Some time later this data is independently written to memory.

A write-invalidate cache coherency protocol (described later in this chapter) is supported through a set of cache states and external coherency requests.

4.1 Primary Instruction Cache

The processor has an on-chip 32-Kbyte primary instruction cache (also referred to simply as the instruction cache), which is a subset of the secondary cache. Organization of the instruction cache is shown in Figure 4-1.

The instruction cache has a fixed block size of 16 words and is two-way set associative with a least-recently-used (LRU) replacement algorithm.*4

Default

Default

The instruction cache is indexed with a virtual address and tagged with a physical address.



Figure 4-1 Organization of Primary Instruction Cache

Each instruction cache block is in one of the following two states:

* Invalid

* Valid

An instruction cache block can be changed from one state to the other as a result of any one of the following events:

* a primary instruction cache read miss

* subset property enforcement

* any of various CACHE instructions

* external intervention exclusive and invalidate requests

These events are illustrated in Figure 4-2, which shows the primary instruction cache state diagram.



Figure 4-2 Primary Instruction Cache State Diagram

4.2 Primary Data Cache

The processor has an on-chip 32-Kbyte primary data cache (also referred to simply as the data cache), which is a subset of the secondary cache. The data cache uses a fixed block size of 8 words and is two-way set associative (that is, two cache blocks are assigned to each set, as shown in Figure 4-3) with an LRU replacement algorithm.*5

Default

Default



Figure 4-3 Organization of Primary Data Cache

The data cache uses a write back protocol, which means a cache store writes data into the cache instead of writing it directly to memory. Sometime later this data is independently written to memory, as shown in Figure 4-4.



Figure 4-4 Write Back Protocol

Write back from the primary data cache goes to the secondary cache, and write back from the secondary cache goes to main memory, through the system interface. The primary data cache is written back to the secondary cache before the secondary cache is written back to the system interface.

The data cache is indexed with a virtual address and tagged with a physical address. Each primary cache block is in one of the following four states:

* Invalid

* CleanExclusive

* DirtyExclusive

* Shared

A primary data cache block is said to be Inconsistent when the data in the primary cache has been modified from the corresponding data in the secondary cache. The primary data cache is maintained as a subset of the secondary cache where the state of a block in the primary data cache always matches the state of the corresponding block in the secondary cache.

A data cache block can be changed from one state to another as a result of any one of the following events:

* primary data cache read/write miss

* primary data cache write hit

* subset enforcement

* a CACHE instruction

* external intervention shared request

* intervention exclusive request

* invalidate request

These events are illustrated in Figure 4-5, which shows the primary data cache state diagram.



Figure 4-5 Primary Data Cache State Diagram

4.3 Secondary Cache

The R10000 processor must have an external secondary cache, ranging in size from 512 Kbytes to 16 Mbytes, in powers of 2, as set by the SCSize mode bit. The SCBlkSize mode bit selects a block size of either 16 or 32 words.

The secondary cache is two-way set associative (that is, two cache blocks are assigned to each set, as shown in Figure 4-6) with an LRU replacement algorithm.*6

Default

Default

The secondary cache uses a write back protocol, which means a cache store writes data into the cache instead of writing it directly to memory. Some time later this data is independently written to memory.

The secondary cache is indexed with a physical address and tagged with a physical address.



Figure 4-6 Organization of Secondary Cache

Each secondary cache block is in one of the following four states:

* Invalid

* CleanExclusive

* DirtyExclusive

* Shared

A secondary cache block can be changed from one state to another as a result of any of the following events:

* primary cache read/write miss

* primary cache write hit to a Shared or CleanExclusive block

* secondary cache read miss

* secondary cache write hit to a Shared or CleanExclusive block

* a CACHE instruction

* external intervention shared request

* intervention exclusive request

* invalidate request

These events are illustrated in Figure 4-7, which shows the secondary cache state diagram.



Figure 4-7 Secondary Cache State Diagram

4.4 Cache Algorithms

The behavior of the processor when executing load and store instructions is determined by the cache algorithm specified for the accessed address. The processor supports five different cache algorithms:

* uncached

* cacheable noncoherent

* cacheable coherent exclusive

* cacheable coherent exclusive on write

* uncached accelerated

Cache algorithms are specified in three separate places, depending upon the access:

* the cache algorithm for the mapped address space is specified on a per-page basis by the 3-bit cache algorithm field in the TLB

* the cache algorithm for the kseg0 address space is specified by the 3-bit K0 field of the CP0 Config register

* the cache algorithm for the xkphys address space is specified by VA[61:59]

Table 4-1 presents the encoding of the 3-bit cache algorithm field used in the TLB; EntryLo0 and EntryLo1 registers; CP0 Config register K0 field for the kseg0 address space; and VA[61:59] for the xkphys address space.

Table 4-1

Cache Algorithm Field Encodings

Descriptions of the Cache Algorithms

This section describes the cache algorithms listed in Table 4-1.

Uncached

Loads and stores under the Uncached cache algorithm bypass the primary and secondary caches. They are issued directly to the System interface using processor double/single/partial-word read or write requests.

Cacheable Noncoherent

Under the Cacheable noncoherent cache algorithm, load and store secondary cache misses result in processor noncoherent block read requests. External agents containing caches need not perform a coherency check for such processor requests.

Cacheable Coherent Exclusive

Under the Cacheable coherent exclusive cache algorithm, load and store secondary cache misses result in processor coherent block read exclusive requests. Such processor requests indicate to external agents containing caches that a coherency check must be performed and that the cache block must be returned in an Exclusive state.

Cacheable Coherent Exclusive on Write

The Cacheable coherent exclusive on write cache algorithm is similar to the Cacheable coherent exclusive cache algorithm except that load secondary cache misses result in processor coherent block read shared requests. Such processor requests indicate to external agents containing caches that a coherency check must be performed and that the cache block may be returned in either a Shared or Exclusive state.

Store hits to a Shared block result in a processor upgrade request. This indicates to external agents containing caches that the block must be invalidated.

Uncached Accelerated

The R10000 processor implements a new cache algorithm, Uncached accelerated. This allows the kernel to mark the TLB entries for certain regions of the physical address space, or certain blocks of data, as uncached while signalling to the hardware that data movement optimizations are permissible. This permits the hardware implementation to gather a number of uncached writes together, either a series of writes to the same address or sequential writes to all addresses in the block, into an uncached accelerated buffer and then issue them to the system interface as processor block write requests. The uncached accelerated algorithm differs from the uncached algorithm in that block write gathering is not performed.

There is no difference between an uncached accelerated load and an uncached load. Only word or doubleword stores can take advantage of this mode.

Stores under the Uncached accelerated cache algorithm bypass the primary and secondary caches. Stores to identical or sequential addresses are gathered in the uncached buffer, described in Chapter 6, Uncached Buffer.

Completely gathered uncached accelerated blocks are issued to the System interface as processor block write requests. Incompletely gathered uncached accelerated blocks are issued to the System interface using processor double/single-word write requests; this is also described in Chapter 6, Uncached Buffer.

4.5 Relationship Between Cached and Uncached Operations

Uncached and uncached accelerated load and store instructions are executed in order, and non-speculatively. Such accesses are buffered in the uncached buffer by the processor until they can be issued to the System interface.

All uncached and uncached accelerated accesses retain program order within the uncached buffer. The processor continues issuing cached accesses while uncached accesses are queued in the uncached buffer.

NOTE: Cached accesses do not probe the uncached buffer for conflicts.

Buffered uncached stores prevent a SYNC instruction from graduating. However buffered uncached accelerated stores do not prevent a SYNC instruction from graduating. The processor continues issuing cached accesses speculatively and out of order beyond a SYNC instruction that is waiting to graduate.

An uncached load may be used to guarantee that the uncached buffer is flushed of all uncached and uncached accelerated accesses.

A SYNC instruction and the SysGblPerf* signal may be used to guarantee that all cache accesses and uncached stores have been globally performed as described in Chapter 6, SysGblPerf* Signal.

An uncached load followed by a SYNC instruction may be used to guarantee that all cache accesses, uncached accesses, and uncached accelerated accesses have been globally performed.

4.6 Cache Algorithms and Processor Requests

The cache algorithm determines the type of processor request generated for secondary cache load misses, secondary cache store misses, and store hits.
Table 4-2 presents the relationship between the cache algorithm and processor requests.

Table 4-2 Cache Algorithms and Processor Requests

4.7 Cache Block Ownership

The processor requires cache blocks to have a single owner at all times. The owner is responsible for providing the current contents of the cache block to any requestor.

The processor uses the following ownership rules:

* The processor assumes ownership of a cache block if the state of the cache block becomes DirtyExclusive. For a processor block read request, the processor assumes ownership of the block after receiving the last doubleword of a DirtyExclusive external block data response and an external ACK completion response. For a processor upgrade request, the processor assumes ownership of the block after receiving an external ACK completion response.

* The processor gives up ownership of a cache block if the state of the cache block changes to Invalid, CleanExclusive, or Shared.

* CleanExclusive and Shared cache blocks are always considered to be owned by memory.

Set

Virtual

Index

Secondary Cache Interface

5

The processor supports a mandatory secondary cache by providing an internal secondary cache controller with a dedicated secondary cache port.

The cache's tag and data arrays each consist of an external bank of industry-standard synchronous SRAM (SSRAM). This SSRAM must have registered inputs and outputs, asynchronous output enables, and use the late write protocol (data is expected one cycle after the address).

5.1 Tag and Data Arrays

The secondary cache consists of a 138-bit wide data array (128 data bits + 9 ECC bits + 1 parity bit) and a 33-bit wide tag array (26 tag bits + 7 ECC bits), as shown in Figure 5-1. ECC is supported for both the data and tag arrays to improve data integrity.



Figure 5-1 Secondary Cache Data and Tag Array

The secondary cache is implemented as a two-way set associative, combined instruction/data cache, which is physically addressed and physically tagged, as described in Chapter 4.

The SCSize mode bits specify the secondary cache size; minimum secondary cache size is 512 Kbytes and the maximum secondary cache size is 16 Mbytes, in increments (512 Kbytes, 1 Mbyte, 2 Mbytes, etc.).

The SCBlkSize mode bit specifies the secondary cache block size. When negated, the block size is 16 words, and when asserted, the block size is 32 words.

5.2 Secondary Cache Interface Frequencies

The secondary cache interface operates at the frequency of SCClk, which is derived from PClk. The SCClkDiv mode bits select a PClk to SCClk divisor of 1, 1.5, 2, 2.5, or 3, using the formula described in Chapter 7.

Synchronization between the PClk and SCClk is performed internally and is invisible to the system. The processor supplies six complementary copies of the secondary cache clock on SCClk(5:0) and SCClk(5:0)*.

5.3 Secondary Cache Indexing

The secondary cache data array width is one quadword, and therefore PA(3:0), which specify a byte within a quadword, are unused by the Secondary Cache interface.

Indexing the Data Array

Since the maximum secondary cache size is 16 Mbytes (8 Mbytes per way), each way requires a maximum of 23 bits to index a byte within a selected way, or 19 bits to index a quadword within a way. Consequently, the processor supplies PA(22:4) on SC(A,B)Addr(18:0) to index a quadword within a way. The processor selects a secondary cache data way with the SC(A,B)DWay signal.

Table 5-1 presents the secondary cache data array index for each secondary cache size; for instance, a 4 Mbyte cache uses the 17 address bits, PA(20:4) on SC(A,B)Addr(16:0), concatenated with the way bit, SC(A,B)DWay, to index a quadword within a 2 Mbyte way.

Table 5-1 Secondary Cache Data Array Index

Indexing the Tag Array

The processor supplies the secondary cache tag array's least significant index bit on SCTagLSBAddr to support two block sizes without system hardware changes. This signal functions normally as a least significant index bit when the secondary cache block size is 16 words. However, when the secondary cache block size is 32 words, this signal is always negated, since only half as many tags are required. The processor supplies the secondary cache tag way on SCTWay.

Table 5-2 presents the secondary cache tag array index for each secondary cache size; it shows each index is composed of a physical address loaded onto SC(A,B)Addr(), concatenated with SCTWay and SCTagLSBAddr.

Table 5-2 Secondary Cache Tag Array Index

For a system design that only supports a secondary cache block size of 32 words, the secondary cache tag array need not use SCTagLSBAddr as an index bit.

5.4 Secondary Cache Way Prediction Table

The primary and secondary caches are two-way set associative. However, the implementation of the secondary cache is different than the primary caches.

The primary caches read simultaneously from two separate tag arrays, corresponding to each way in the cache, and then select the data based on the result of two parallel tag compares.

The secondary cache does not use this implementation because it would either require too many pins to read in two full copies of the data and tags, or add latency to externally multiplex two banks of memory. Instead, a way prediction table is used to determine which way to read from first.

The way prediction table is internal to the processor and has 8K one-bit entries, each entry corresponding to a pair of secondary cache blocks. The bit entry indicates which way of the addressed set has been most-recently used (MRU). When the secondary cache is accessed, this prediction bit is used as an address bit; thus the two ways in the secondary cache are shared in the same SSRAM bank.

The secondary cache way prediction table is indexed with a subset of 11 to 13 bits of the physical address, based on both the secondary cache block size, and the secondary cache size, as shown in Table 5-3. "0 ||" indicates a zero bit concatenated to the address to pad the index out to a full 13-bits.

Table 5-3

Secondary Cache Way Prediction Table Index

Three states are possible in the way prediction table:

* the desired data is in the predicted way

* the desired data is in the non-predicted way

* the desired data is not in the secondary cache

The tags for both ways are read "underneath" the data access cycles in order to discern as rapidly as possible which of these states are valid. This reading is possible because it takes two accesses to read a primary data block (8 words) and 4 cycles to read a primary instruction block (16 words); thus the bandwidth needed to read the tag array twice exists in all cases. Only an extra address pin to the tag array is needed to make this operation parallel and this is implemented by the SCTWay pin.

The three possible states are handled in the following manner:

* If, after reading the tags for both ways, it is discovered that the data exists in the predicted way, the processor continues normally.

* If the data exists in the non-predicted way, the processor accesses this non-predicted way in the secondary cache and updates the way prediction table to point to this way.

* If the access misses in both ways of the secondary cache, the data is fetched from the System interface into the non-predicted way, and the way prediction table is updated to point to this way since it is now the most-recently-used.

The way prediction table can cover up to a 2 Mbyte secondary cache when the secondary cache block size is 32 words. If the secondary cache exceeds this size, the accuracy of the way prediction table diminishes slightly. However, the extremely large performance gain made by making the secondary cache larger far outstrips any performance loss in the way prediction table.

5.5 Secondary Cache Tag

The secondary cache tag, transferred on the SCTag(25:0) bus, is divided into three fields, as shown in Figure 5-2 below.



Figure 5-2 Secondary Cache Tag Fields

SCTag(25:4), Physical Tag

The minimum secondary cache size is 512 Kbytes (256 Kbytes per way), so a minimum of 18 bits are required to index a data byte within a selected way. Since the processor supports 40 physical bits, a maximum of 22 bits are required for the physical tag:

40 physical address bits - 18 minimum required = 22

Consequently, the processor supplies the 22 physical address bits, PA(39:18), on SCTag(25:4) for the physical tag.

When the secondary cache is larger than the minimum size, the secondary cache tag array must still maintain the full physical tag supplied by the processor, even though some bits are redundant.

SCTag(3:2), PIdx

Bits SCTag(3:2) of the secondary cache tag contain the primary cache index, PIdx.

The PIdx field contains VA(13:12), which are the two lowest virtual address bits above the minimum 4 Kbyte page size. This field is written into the secondary cache tag during a secondary cache refill. For each processor-initiated secondary cache access, the virtual address bits are compared with the PIdx field of the secondary cache tag. If a mismatch occurs, a virtual coherency condition exists and the value of the PIdx field is used by internal control logic to purge primary cache locations, so that all primary cache blocks holding valid data have indices known to the secondary cache. This mechanism, unlike that of the R4400 processor, is implemented in hardware. It helps preserve the integrity of cached accesses to a physical address using different virtual addresses, an occurrence called virtual aliasing. For each external coherency request, the PIdx field of the secondary cache tag provides a mechanism to locate subset lines in the primary caches.

SCTag(1:0), Cache Block State

The lower two bits of the secondary cache tag, SCTag(1:0), contain the cache block state, which can be Invalid, Shared, CleanExclusive, or DirtyExclusive as shown in Table 5-4.

Table 5-4 Secondary Cache Tag State Field Encoding

Since the secondary cache tags are updated immediately for stores to the primary data cache, and all caches use a write back protocol, the data in the secondary cache may not always be consistent with data in the primary cache even though the tags always reflect the correct state of a secondary cache block.

5.6 Read Sequences

There are five basic read sequences:

* a 4-word read

* an 8-word read

* a 16-word read

* a 32-word read

* a tag read

4-Word Read Sequence

A 4-word read sequence is performed by a CACHE Index Load Data (S) instruction to read a doubleword of data and 10 check bits from the secondary cache data array.

Figure 5-3 depicts a secondary cache 4-word read sequence. A quadword is read from the index specified by PA(23:6), and the way specified by VA(0) of the CACHE instruction.

The doubleword specified by VA(3) is then stored into the CP0 TagHi and TagLo registers, and the corresponding check bits are stored into the CP0 ECC(9:0) register. The data may be examined by copying the CP0 TagHi, TagLo, and ECC registers to the general registers with the MTC0 instruction.



Figure 5-3 4-Word Read Sequence

8-Word Read Sequence

An 8-word read sequence refills the primary data cache from the secondary cache after a primary data cache miss.

Figure 5-4 depicts a secondary cache 8-word read sequence. In it, SC(A,B)DWay and SCTWay are driven with value X on the first address cycle, which is obtained from the way prediction table.

On the next address cycle, SCTWay is complemented in order to read the tag from the non-predicted way of the addressed set. SC(A,B)DWay is not changed since it is assumed that the way prediction table is correct and the read is likely to hit in the predicted way.

The tag for the non-predicted way is returned to the processor in the same cycle as the second quadword of data. Reads that miss in the predicted way, but hit in the non-predicted way, are noted by the internal control logic and reissued to the secondary cache as soon as possible.



Figure 5-4 8-Word Read Sequence

16 or 32-Word Read Sequence

A 16-word read sequence refills the primary instruction cache from the secondary cache after a primary instruction cache miss. A 16-word read sequence is also performed when the secondary cache block size is 16 words, and a DirtyExclusive secondary cache block must be written back to the System interface.

A 32-word read sequence is performed when the secondary cache block size is 32 words, and a DirtyExclusive secondary cache block must be written back to the System interface.

Figure 5-5 depicts a secondary cache 16 or 32-word read sequence. This is similar to an 8-word read sequence except that more addresses must be issued, in order to read the appropriate number of quadwords.



Figure 5-5 16 or 32-Word Read Sequence

Tag Read Sequence

A tag read sequence is performed when the state of a secondary cache block is required, but it is not necessary to access the data array. This sequence is used for the CACHE Index Load Tag (S) instruction.

Figure 5-6 depicts a secondary cache tag read sequence.



Figure 5-6 Tag Read Sequence

5.7 Write Sequences

There are five basic write sequences:

* a 4-word write.

* an 8-word write

* a 16-word write

* a 32-word write

* a tag write

4-Word Write Sequence

A 4-word write sequence is performed by a CACHE Index Store Data (S) instruction to store a quadword of data and 10 check bits into the secondary cache data array.

Figure 5-7 depicts a secondary cache 4-word write sequence. A quadword is written to the index specified by PA(23:6), and the way specified by VA(0) of the CACHE instruction.

A doubleword specified by VA(3) is obtained from the CP0 TagHi and TagLo registers, and the other half of the doubleword is padded to zeros. Normal ECC and parity generation is bypassed and the check field of the data array is written with the contents of the CP0 ECC(9:0) register.



Figure 5-7 4-Word Write Sequence

8-Word Write Sequence

An 8-word write sequence writes back a dirty block from the primary data cache to the secondary cache.

Figure 5-8 depicts a secondary cache 8-word write sequence. SC(A,B)DWay are driven with the way bit obtained from the primary data cache tag. The secondary cache tag is not written since it was previously updated when the primary data cache block was modified.



Figure 5-8 8-Word Write Sequence

16 or 32-Word Write Sequence

A 16- or 32-word write sequence refills a secondary cache block from the System interface after a secondary cache miss. A 16-word write sequence is performed when the secondary cache block size is 16 words, and a 32-word write sequence is performed when the secondary cache block size is 32 words.

Figure 5-9 depicts a secondary cache 16 or 32-word write sequence.



Figure 5-9 16/ 32-Word Write Sequence

Tag Write Sequence

A tag write sequence updates the secondary cache tag array without affecting the data array. This sequence is used for the following:

* to reflect primary cache state changes in the secondary cache

* for external coherency requests

* for the CACHE Index Store Tag (S) instruction

Figure 5-10 depicts the secondary cache tag write protocol.



Figure 5-10 Tag Write Sequence

Generated with CERN WebMaker