This chapter gives an overview of system components and the management of physical and virtual memory in SGI Altix series systems, which are based on the Itanium Processor Family (IPF) of processors. This chapter also provides background information to help you understand the limitations and special conventions used by some kernel functions.
The following main topics are covered in this chapter:
The SGI Altix servers are a family of multiprocessor distributed shared memory (DSM) computer systems. The SGI Altix systems use a global-address-space cache-coherent multiprocessor that can scale up to 512 processors in a cache-coherent domain. The processors are housed in a 3-U high brick called the SC-brick. The SC-brick contains two processor nodes. A processor node consists of two processors, each with 1.5- or 3-MB on-chip, private tertiary (L3) cache, connected to the scalable hub (SHub) ASIC via the front side bus (FSB). The SHub ASIC acts as a crossbar between the processors, local SDRAM memory, the network interface, and the I/O interface. Each processor node is interconnected by a NUMAlink 4 channel. The modularity of the DSM approach combines the advantages of low entry-level cost with global scalability in processors, memory, and I/O. The SGI Altix systems are based on the Intel Itanium 2 processor. The Intel Itanium 2 processor is a 64-bit processor that is initially offered at 900 MHz clock speed with a 1.5 MB L3 cache size.
The SGI Altix has a PCI-X-based I/O system. (For more details on PCI-X devices, see Chapter 3, “PCI-X Device Attachment”). The I/O components are housed in an I/O brick. Following are the two types of I/O bricks:
Figure 2-1, shows the links between the various bricks of the SGI Altix system.
The following sections provide additional information of the various system bricks. These sections describe the following system components:
Compute/processor node (SC-brick)
PCI-X with BaseIO (IX-brick)
PCI-X with expansion (PX-brick)
The SC-brick is a 3U (4.5”), 1U==1.5”, rackmountable enclosure that contains the following components:
Two processor nodes, each containing two 64-bit processors with 1.5- or 3-MB secondary caches.
Two SHub chipsets.
Sixteen DIMM slots per SHub; one or two memory banks per four DIMMs.
Node electronics.
One L1 controller.
The node electronics, L1 controller, and power regulators are contained on a single half-panel power board (PCB). The two SHubs, four processors, and processor power pods are housed on separate half-panel boards. Four memory daughtercards house the memory DIMMs. Each daughtercard supports eight memory DIMMs. Figure 2-2, shows the block diagram of an SC-brick.
The SC-brick has the following features:
Two 64-bit processors
Contains one 1.5- or 3-MB secondary cache per processor (integrated within the processor)
Configurable from 2.0 GB to 16 GB of main memory (minimum 8 DIMMs)
Contains two 6.4-GB/s (each direction) NUMAlink channels
Contains two 2.4-GB/s (each direction) Xtown2 channels
Contains one connection port to the L2 controller
Contains one DB9 console port
The IX-brick is actually a PX-brick with a BaseIO card in PCI-X bus Q, slot Q, plus a drive module. The BaseIO card consists of the following components:
IOC4 components:
| ATA bus connected to DVD-ROM |
| NVRAM |
| Real-time clock |
| Real-time input/output ports |
| Serial ports |
| PS/2 keyboard and mouse ports |
Ethernet network chipset
SCSI controller
Figure 2-3, shows an IX-brick.
The PX-brick contains six PCI-X buses with two slots per bus to make a total of 12 PCI-X slots. PX-bricks can be connected to the system via two Xtown2 links. The PX-brick PCI-X expansion is shown in Figure 2-4.
SGI Altix systems support 64-bit mode addressing. This section refers to the 64-bit address spaces provided by the SGI Altix system microprocessor (see Figure 2-8). This architecture uses addresses that are 64-bit unsigned integers from 0x0000 0000 0000 0000 to 0xFFFF FFFF FFFF FFFF. This is an immense span of numbers--if it were drawn to a scale of 1 millimeter per terabyte, the drawing would be 16.8 kilometers long (just over 10 miles).
The following types of space are described in this section:
Physical address
Global Memory mapped register (MMR)
Atomic memory operation (AMO)
Cacheable memory
SHub physical address map
This section provides physical address space information that is normally used by device drivers. SGI Altix systems support 50-bit physical addressing, as shown in Figure 2-5.
Fields in Figure 2-5, are defined as follows:
| Bits | Description | ||||||||||||||||
| 63:50 | Unused and reserved for future use. The value of these bits should always be zero. This leaves 512 terabytes of addressing for SGI Altix systems implemented with SHub. | ||||||||||||||||
| 49:38 | Node ID bits. SGI Altix systems implemented with SHub support up to 1024 processor nodes (2048 CPUs per system). Bit 38 indicates the node type. A value of 0 indicates a processor node. Bit 49 is always 0. | ||||||||||||||||
| 37: 36 | Address space (AS). Each SHub is allocated 256 GB of physical address space. Bits 37:36 divide the 256 GB into four 64-GB spaces, as follows:
The AS bits are analogous to the uncached attribute bits of the SGI Origin series systems; however, since Itanium 2 processors do not support uncached attribute bits in the translation lookaside buffer (TLB), physical address bits are used to perform the equivalent function. | ||||||||||||||||
| 35:0 | Node offset. These bits point to a specific byte location within one of the four 64-GB spaces of the SHub. When the value of bits 37:36 is 0b00, the 64-GB local resource space and global MMR space is really split into two 32-GB regions: 32 GB of local resource space and 32 GB of global MMR space. Bit 35 selects between these two regions. When the value of bits 37:35 is 0b000, the request targets the local resource space. When the value of bits 37:35 is 0b001, the request targets the global MMR space. |
The following sections describe global MMR space, AMO space, and cacheable memory space.
A node's global memory mapped register (MMR) space provides all processor nodes in the system with access to a node's MMRs (see Figure 2-6). Notice the position of the global MMR space in the physical address map shown in Figure 2-8. Following are the values of the bits for global MMR space:
| Bit | Value | |
| 49 | 0 | |
| 48:38 | Node ID (remember, SHubs are even nodes) | |
| 37:36 | 00 (AS bits) | |
| 35 | 1 |
| Note: Programmable I/O addresses reside in this space (for example, SHub systems, registers set, PCI configuration space, PCI I/O and memory space, I/O brick registers, and so on). |
When the address space (AS) bits are set to 10, the reference is to atomic memory operation (AMO) space. An AMO read operation (AMOR) or AMO write operation (AMOW) request is issued to the SHub that is identified by the number in the node ID (see Figure 2-7). Notice the position of the AMO space in the physical address map shown in Figure 2-8. The node offset bits specify a 36-bit offset within the SHub address space, as follows:
| Bit | Value | |
| 49 | 0 | |
| 48:38 | Node ID (remember, SHubs are even nodes) | |
| 37:36 | 10 | |
| 35:0 | Node offset |
A number of fetch-and-op style AMOs are supported to optimize common synchronization primitives such as locks, tickets, and barriers. These AMOs operate on a read-modify-write basis. AMOs are defined only for word and doubleword data sizes and are performed using uncached loads and stores to the AMO address space. In addition, operations are allowed only on the first doubleword of each 64-byte block (half cache line) in memory. The AMO variable can be accessed either as one 64-bit AMO variable or as two 32-bit AMO variables.
In the AMO address space, bits 5:3 of the node offset (the three address bits above the doubleword offset) determine the type of AMO to perform.
The following AMO read operations are supported:
| Fetch | Simple uncached read of the location. | |
| Fetch and Increment | The location's current value is returned and then the location's value is incremented. This operation is followed by a write operation. | |
| Fetch and Decrement | The location's current value is returned and then the location's value is decremented. This operation is followed by a write operation. | |
| Fetch and Clear | The location's current value is returned and then the location's value is cleared. This operation is followed by a write operation. |
The following AMO write operations are supported:
| Initialize | Simple uncached write of the location. | |
| Increment | The location's value is incremented. | |
| Decrement | The location's value is decremented. | |
| Logical AND | Stored data is logically AND'd with the location's current value. | |
| Logical OR | Stored data is logically OR'd with the location's current value. |
When the AS bits are set to 11, the reference is to cacheable memory space. A memory request is issued to the SHub that is identified by the number in the node ID. The node offset bits specify a 36-bit offset within the SHub address space. UC, WB, and WC attributes are supported for cacheable memory space. Notice the position of the cacheable memory space in the physical address map shown in Figure 2-8. The 50-bit physical address has a 36-bit offset within the SHub address space, as follows:
| Bit | Value | |
| 49 | 0 | |
| 48:38 | Node ID (remember, SHubs are even nodes) | |
| 37:36 | 11 (AS bits) | |
| 35:0 | Node offset |
| Note: Direct memory access (DMA) addresses reside in cacheable memory space. |
The primary, secondary, and tertiary caches shown in Figure 2-10, are essential to CPU performance. There is an order of magnitude difference in the speed of access between cache memory and main memory. Execution speed remains high only as long as a very high proportion of memory accesses are satisfied from the primary, secondary, or tertiary cache.
The use of caches means that there are often multiple copies of data: a copy in main memory, a copy in the secondary cache (when one is used), and a copy in the primary cache. Moreover, a multiprocessor system has multiple CPU modules like the one shown in Figure 2-10, and there can be copies of the same data in the cache of each CPU.
The problem of cache coherency is to ensure that all cache copies of data are true reflections of the data in main memory. Different SGI systems use different hardware designs to achieve cache coherency.
Multiprocessor systems have more complex cache coherency protection because it is possible to have data in multiple caches. In an SGI Altix multiprocessor system, the hardware ensures that cache coherency is maintained under all conditions, including DMA input and output, without action by the software.
Figure 2-8 shows the SHub physical address map. On SHub, AMO space and global MMR space must be accessed uncached, and GET space must be accessed cached. Cacheable memory space can be accessed cached or uncached, subject to operating system constraints.
| Note: Linux drivers run in virtual mode (TLBs enabled for all addresses) all the time. Therefore, the address space they see depends not only on behavior of the SHub, but also on the TLB mapping conventions of the operating system. |
Figure 2-12, is too simple for some devices that are attached through a bus adapter. A bus adapter connects a bus of a different type to the system bus, as shown in Figure 2-9.
For example, the PCI/PCI-X bus adapter connects a PCI/PCI-X bus to the Xtalk I/O interface of SHub. Multiple PCI/PCI-X devices can be plugged into the PCI/PCI-X bus and use the bus to read and write. The bus adapter translates the PCI/PCI-X bus protocol into the system Xtalk protocol.
Each PCI/PCI-X bus has address lines that carry the address values used by devices on that PCI/PCI-X bus. These bus addresses are not related to the physical addresses used on the system front side bus (FSB). The issue of bus addressing is made complicated by three facts:
Bus-master devices independently generate memory-read and memory-write commands that are intended to access system memory.
The bus adapter can translate addresses between addresses on the bus it manages, and different addresses on the system bus it uses.
The translation done by the bus adapter can be programmed dynamically (mapped), and can change from one I/O operation to another.
This subject can be simplified by dividing it into two distinct subjects: PIO addressing, used by the CPU to access a device, and DMA addressing, used by a bus master to access memory. These addressing modes need to be treated differently.
Programmable I/O (PIO) is the term for a load or store instruction executed by the CPU that names an I/O device space as its operand. The CPU places a physical address on the system bus. The bus adapter repeats the read or write command on its bus, but not necessarily using the same address bits as the CPU put on the system bus.
One task of a bus adapter is to translate between the physical addresses used on the system bus and the addressing scheme used within the proprietary bus. The address placed on the target bus is not necessarily the same as the address generated by the CPU. The translation is done differently with different bus adapters and in different system models.
With the more sophisticated PCI and PCI-X buses, the translation is dynamic. Both of these buses support bus address spaces that are as large or larger than the physical address space of the system bus. It is impossible to hard-wire a translation of the entire bus address space. Furthermore, SGI Altix architecture provides multiple system buses. For more details, see “Address Spaces Supported” in Chapter 3.
The PCI/PCI-X resource addresses in the pci_dev structure are PIO mapped addresses that the device driver can use in their existing state.
To use a dynamic PIO address, a device driver can create a software object called a PIO map that represents that portion of bus address space that contains the device registers the driver uses. When the driver wants to use the PIO map, the kernel dynamically sets up a translation from an unused part of physical address space to the needed part of the bus address space. The driver extracts an address from the PIO map and uses it as the base for accessing the device registers. This is an extension that SGI provides.
A bus-master device on the PCI bus can be programmed to perform transfers to or from memory independently and asynchronously. A bus master is programmed using PIOs with a starting bus address and a length. The bus master generates a series of memory-read or memory-write operations to successive addresses. But what bus addresses should it use in order to store into the proper memory addresses?
The bus adapter translates the addresses used on the proprietary bus to corresponding addresses on the system bus. As shown in Figure 2-9, the operation of a DMA device is as follows:
The device places a bus address and data on the PCI or PCI-X bus.
The bus adapter translates the address to a meaningful physical address, and places that address and the data on the system Xtalk I/O link.
The memory modules store the data.
The translation of bus virtual to physical addresses is done by the bus adapter and programmed by the kernel. A device driver requests the kernel to set up a dynamic mapping from a designated memory buffer to bus addresses. For more information, see Chapter 9, “PCI-X Direct Memory Access (DMA)”.
Linux device drivers on SGI Altix systems must use the standard Linux pci_dma map routines. For more information, see Chapter 9, “PCI-X Direct Memory Access (DMA)”.
The driver calls kernel functions to establish the range of memory addresses that the bus master device will need to access--typically the address of an I/O buffer. When the driver calls one of the pci_dma map routines, the kernel sets up the bus adapter hardware to translate between some range of bus addresses and the desired range of memory space. The driver uses PIO to program this bus address into the bus master device registers. SGI software supports 64- and 32-bit DMA addresses. For more information on 64- and 32-bit DMA map addresses, see Chapter 9, “PCI-X Direct Memory Access (DMA)”.
The following sections describe CPU and device access to memory.
Each SGI computer system has one or more CPU modules and one or more I/O modules. A CPU reads data from memory or a device by placing an address on a system bus and receiving data back from the addressed memory or device. An address can be translated more than once as it passes through multiple layers of I/O chipsets and bus adapters. Access to memory can also pass through multiple levels of cache.
The CPU generates the address of data that it needs--the address of an instruction to fetch, or the address of an operand of an instruction. It requests the data through a mechanism that is depicted in simplified form in Figure 2-10.
The process is as follows:
The address of the needed data is formed in the processor execution or instruction-fetch unit. Most addresses are then mapped from virtual to real through the translation lookaside buffer (TLB). On Itanium 2 processors, all addresses go through the TLBs if TLBs are enabled. With some very small exceptions, TLBs are always enabled.
Most addresses are presented to the L1 cache, a cache in the processor chip. If a copy of the data with that address is found, it is returned immediately. Certain address ranges are never cached; these addresses pass directly to the bus.
If the L1 cache does not contain the data, the address is presented to the L2 cache. If it contains a copy of the data, the data is returned immediately. The size and the architecture of the secondary cache differ from one CPU model to another.
If L2 does not contain the data, the address is presented to the L3 cache. The address is placed on the system bus. The memory module that recognizes the address places the data on the bus.
The process in Figure 2-10 is correct for an SGI Altix system when the addressed data is in the local node.
| Note: When the address applies to memory in another node, the address passes out through the connection fabric to a memory module in another node, from which the data is returned. |
The CPU accesses a device register using programmable I/O (PIO), a process illustrated in Figure 2-11. Access to device registers is always uncached. It is not affected by considerations of memory cache coherency in any system (see “Cache Use”).
The process is as follows:
The address of the device is formed in the execution unit. It is not usually an address that is mapped by the TLB.
A device address, after mapping if necessary, always falls in one of the ranges that is not cached, so it passes directly to the system bus.
The device or system component (such as SHub) recognizes its physical address and responds with data.
The PIO process shown in Figure 2-11, is correct for an SGI Altix system when the addressed device is attached to the same node. When the device is attached to a different node, the address passes through the connection fabric to that node, and the data returns the same way.
Some devices can perform direct memory access (DMA), in which the device itself, not the CPU, reads or writes data into memory. A device that can perform DMA is called a bus master because it independently generates a sequence of bus accesses without help from the CPU.
To read or write a sequence of memory addresses, the bus master has to be told the proper physical address (bus address) range to use. This is done by using PIO to store a bus address and length into the device's registers from the CPU. When the device has the DMA information, it can access memory through the system bus as shown in Figure 2-12.
The process is as follows:
The device makes a request on the PCI/PCI-X bus.
The PCI/PCI-X bus adapter translates the PCI/PCI-X bus request and generates a request to the I/O chipset (SHub).
The local SHub forwards the request to the requested memory controllers (local or remote).
The memory module stores the data.
In an SGI Altix system, the device and the memory module can be in different nodes, with address and data passing through the connection fabric (NUMAlink) between nodes.
When a device is programmed with an invalid physical address, the result is a bus error interrupt. The interrupt occurs on some CPU that is enabled for bus error interrupts. These interrupts are not simple to process for two reasons. First, the CPU that receives the interrupt is not necessarily the CPU from which the DMA operation was programmed. Second, the bus error can occur a long time after the operation was initiated.