Chapter 1. Processor cores
A processor core is a physical Central Processing Unit (CPU) in a computer. Cores are responsible for executing machine code. A socket is the connection between the processor and the motherboard of the computer. The socket is the location on the motherboard that the processor is placed into. A single core processor physically occupies one socket, and has one core available. A quad-core processor physically occupies one socket and has four cores available.
When designing realtime applications, take the number of available cores into account. It is also important to note how caches are shared among cores, and how the cores are physically connected.
If multiple cores are available to the application, use threads or processes to take advantage of them. If a program is written without using these constructs, it will only run on one processor at a time. A multi-core platform allows advantages to be gained through using different cores for different types of operations.
Often, the various threads of an application will need to synchronize access to a shared resource, such as a data structure. Performance can be improved in this case by knowing the cache layout of the system. The Tuna tool can be used to help determine the cache layout. Try binding interacting threads to cores, so that they share the cache. Cache sharing reduces memory faults by ensuring that the mutual exclusion primitive (mutex, condvar, or similar) and the data structure itself use the same cache.
It is important to examine the interconnects that occur between cores. As the number of cores in a machine rise, the more difficult and expensive it becomes to provide uniform access to the memory for all of them. Many hardware vendors now provide a transparent network of interconnects between cores and memory, known as a NUMA (non-uniform memory access) architecture. On NUMA systems, knowing the interconnect topology allows threads that communicate frequently to be placed on adjacent cores.
Chapter 2. Memory allocation
Linux-based operating systems use a virtual memory system. Any address referenced by a user-space application must be translated into a physical address. This is achieved through a combination of page tables and address translation hardware in the underlying computer system.
One consequence of having the translation mechanism in between a program and the actual memory is that the operating system can steal pages when required. This is achieved by marking a previously used page table entry as invalid, so that even under normal memory pressure, the operating system might scavenge pages from one application to give to another. This can have adverse affects on systems that require deterministic behavior. Instructions that normally execute in a fixed amount of time can take longer than normal because a page fault has been triggered.
Under Linux, all memory addresses generated by a program get passed through an address translation mechanism in the processor. The addresses are converted from a process-specific virtual address to a physical memory address. This is referred to as virtual memory.
The two main components in the translation mechanism are page tables and translation lookaside buffers (TLBs). Page tables are multi-level tables in physical memory that contain mappings for virtual to physical memory. These mappings are readable by the virtual memory translation hardware in the processor. TLBs are caches for page table translations.
When a page table entry has been assigned a physical address, it is referred to as the resident working set. When the operating system needs to free memory for other processes, it can remove pages from the working set. When this happens, any reference to a virtual address within that page will create a page fault, and the page will be reallocated. If the system is extremely low on physical memory, then this process will start to thrash, constantly stealing pages from processes, and never allowing a process to complete. The virtual memory statistics can be monitored by looking for the pgfault
value in the /proc/vmstat
file.
TLBs are hardware caches of virtual memory translations. Any processor core with a TLB will check the TLB in parallel with initiating a memory read of a page table entry. If the TLB entry for a virtual address is valid, the memory read is aborted and the value in the TLB is used for the address translation.
TLBs operate on the principle of locality of reference. This means that if code stays in one region of memory for a significant period of time (such as loops or call-related functions) then the TLB references avoid the main memory for address translations. This can significantly speed up processing times. When writing deterministic and fast code, use functions that maintain locality of reference. This can mean using loops rather than recursion. If recursion cannot be avoided, place the recursion call at the end of the function. This is called tail-recursion, which makes the code work in a relatively small region of memory and avoid fetching table translations from main memory.
A potential source of memory latency is called a minor page fault. They are created when a process attempts to access a portion of memory before it has been initialized. In this case, the system will need to perform some operations to fill the memory maps or other management structures. The severity of a minor page fault can depend on system load and other factors, but they are usually short and have a negligible impact.
A more severe memory latency is a major page fault. These can occur when the system has to synchronize memory buffers with the disk, swap memory pages belonging to other processes, or undertake any other Input/Output activity to free memory. This occurs when the processor references a virtual memory address that has not had a physical page allocated to it. The reference to an empty page causes the processor to execute a fault, and instructs the kernel code to allocate a page and return, all of which increases latency dramatically.
When writing a multi-threaded application, it is important to consider the machine topology when designing the data decomposition. Topology is the memory hierarchy, and includes CPU caches and the NUMA node. Sharing data information very tightly on CPUs in different cache and NUMA domains can lead to traffic problems and bottlenecks.
Contention can create drastic performance problems. On some hardware, the traffic on the various memory buses are not subject to any fairness rules. Always check the hardware you are using in order to avoid this.
Memory allocation errors can not always be eliminated through the use of CPU affinity, scheduling policies, and priorities. When an application shows a performance drop, it can be beneficial to check if it is being affected by page faults. There are a number of ways of doing this, but a simple method is to look at the process information in the /proc
directory. For a particular process PID, use the cat
command to view the /proc/PID
/stat
file. The relevant entries in this file are:
Field 2
- filename of the executable
Field 10
- number of minor page faults
Field 12
- number of major page faults
When a process encounters a page fault all its threads will be frozen until the kernel handles the fault. There are several ways to address this problem, although the best solution is to adjust the source code to avoid page faults.
Example 2.1. Using the /proc
file to check for page faults
This example uses the /proc
file to check for page faults in a running process.
Use the cat
command and a pipe function to return only the second, tenth, and twelfth lines of the /proc/PID
/stat
file:
# cat /proc/3366/stat | cut -d\ -f2,10,12
(bash) 5389 0
In the above output, PID 3366 is bash
, and it has reported 5389 minor page faults, and no major page faults.
For more information, or for further reading, the following sources are related to the information given in this section:
2.2. Using mlock
to avoid memory faults
The mlock
and mlockall
system calls tell the system to lock to a specified memory range, and to not allow that memory to be paged. This means that once the physical page has been allocated to the page table entry, references to that page will not fault again.
There are two groups of mlock system calls available. The mlock
and munlock
calls lock and unlock a specific range of addresses. The mlockall
and munlockall
calls lock or unlock the entire program space.
Use of the mlock
calls should be examined carefully and used with caution. If the application is large, or if it has a large data domain, the mlock
calls can cause thrashing if the system cannot allocate memory for other tasks. If the application is entering a time sensitive region of code, an mlockall
call prior to entering, followed by munlockall
can reduce paging while in the critical section. Similarly, mlock
can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.
Use of mlock
will not guarantee that the program will experience no page faults. It is used to ensure that the data will stay in memory, but can not ensure that it will stay in the same page. Other functions such as move_pages
and memory compactors can move data around despite the mlock
.
Always use mlock
with care. Using it excessively can lead to an out of memory
(OOM) error. Do not just put an mlockall
call at the start of your application. It is recommended that only the data and text of the realtime portion of the application be locked.
Example 2.2. Using mlock
in an application
This example uses the mlock
call in a simple application.
#include <stdio.h>
#include <sys/types.h>
#include <sys/mman.h>
char *
alloc_workbuf(int size)
{
char *ptr;
/* allocate some memory */
ptr = malloc(size);
/* return NULL on failure */
if (ptr == NULL)
return NULL;
/* lock this buffer into RAM */
if (mlock(ptr, size)) {
free(ptr);
return NULL;
}
return ptr;
}
void
free_workbuf(char *ptr, int size)
{
/* unlock the address range */
munlock(ptr, size);
/* free the memory */
free(ptr);
}
For more information, or for further reading, the following sources are related to the information given in this section:
mlock(2)
mlock(3)
mlockall(2)
move_pages(2)
Chapter 3. Hardware interrupts
Hardware interrupts are used by devices to communicate that they require attention from the operating system. Some common examples are a hard disk signaling that is has read a series of data blocks, or that a network device has processed a buffer containing network packets. Interrupts are also used for asynchronous events, such as the arrival of new data from an external network. Hardware interrupts are delivered directly to the CPU using a small network of interrupt management and routing devices. This chapter describes the different types of interrupt and how they are processed by the hardware and by the operating system. It also describes how the MRG Realtime kernel differs from the standard kernel in handling the types of interrupt.
A standard system receives many millions of interrupts over the course of its operation, including a semi-regular "timer" interrupt that periodically performs maintenance and system scheduling decisions. It may also receive special kinds of interrupts, such as NMI (Non-Maskable Interrupts) and SMI (System Management Interrupts).
Hardware interrupts are referenced by an interrupt number. These numbers are mapped back to the piece of hardware that created the interrupt. This enables the system to monitor which device created the interrupt and when it occurred.
In most computer systems, interrupts are handled as quickly as possible. When an interrupt is received, any current activity is stopped and an interrupt handler is executed. The handler will preempt any other running programs and system activities, which can slow the entire system down, and create latencies. MRG Realtime modifies the way interrupts are handled in order to improve performance, and decrease latency.
Example 3.1. Viewing interrupts on your system
To examine the type and quantity of hardware interrupts received by a Linux system, use the cat
command to view /proc/interrupts
:
$ cat /proc/interrupts
CPU0 CPU1
0: 13072311 0 IO-APIC-edge timer
1: 18351 0 IO-APIC-edge i8042
8: 190 0 IO-APIC-edge rtc0
9: 118508 5415 IO-APIC-fasteoi acpi
12: 747529 86120 IO-APIC-edge i8042
14: 1163648 0 IO-APIC-edge ata_piix
15: 0 0 IO-APIC-edge ata_piix
16: 12681226 126932 IO-APIC-fasteoi ahci, uhci_hcd:usb2, radeon, yenta, eth0
17: 3717841 0 IO-APIC-fasteoi uhci_hcd:usb3, HDA, iwl3945
18: 0 0 IO-APIC-fasteoi uhci_hcd:usb4
19: 577 68 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb5
NMI: 0 0 Non-maskable interrupts
LOC: 3755270 9388684 Local timer interrupts
RES: 1184857 2497600 Rescheduling interrupts
CAL: 12471 2914 function call interrupts
TLB: 14555 15567 TLB shootdowns
TRM: 0 0 Thermal event interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0
The output shows the various types of hardware interrupt, how many have been received, which CPU was the target for the interrupt, and the device that generated the interrupt.
3.1. Level-signalled interrupts
Level-signalled interrupts, use a dedicated interrupt line to deliver voltage transitions.
The dedicated line can send one of two voltages to represent a binary 1 or binary 0. Once a signal has been sent by the line, it will remain in that state until the CPU specifically resets it. This is achieved by the CPU asking the generating device to stop asserting the line. This allows a number of devices to share a single interrupt line. If the CPU has instructed a device to stop asserting the line, and it remains asserted, there is another interrupt pending.
Although level-signalled interrupts require a high level of hardware logic in both the devices and the CPU, they also provide a number of benefits. Not only can they be used by more than one device, but they are almost completely unable to miss an interrupt.
3.2. Message-signalled interrupts
Many modern systems use message-signalled interrupts, which send the signal as a dedicated message on a packet or message-based electrical bus.
One common example of this type of bus is PCI Express (Peripheral Component Interconnect Express, or PCIe). These devices transmit a message as a type that the PCIe Host Controller interprets as an interrupt message. The host controller then sends the message on to the CPU.
Depending on the hardware, a PCIe system might send the signal using a dedicated interrupt line between the PCIe host controller and the CPU, or by sending the message over (for example) the CPU HyperTransport bus. Many PCIe systems can also operate in legacy mode, where legacy interrupt lines are implemented in order to support older operating systems, or Linux kernels booted with the option pci=nomsi
on the kernel command line.
3.3. Non-maskable interrupts
An interrupt is said to be masked when it has been disabled, or when the CPU has been instructed to ignore it. A non-maskable interrupt (NMI) cannot be ignored, and is generally used only for critical hardware errors.
NMIs are normally delivered over a separate interrupt line. When an NMI is received by the CPU, it indicates that a critical error has occurred, and that the system is probably about to crash. The NMI is generally the best indication of what might have caused the problem.
Because NMIs are not able to be ignored, they are also used by some systems as a hardware monitor. The device sends a stream of NMIs, which are checked by an NMI handler in the processor. If certain conditions are met - such as an interrupt not being triggered after a specified length of time - the NMI handler can produce a warning and debugging information about the problem. This helps to identify and prevent system lockups.
3.4. System management interrupts
System management interrupts (SMIs) are used to offer extended functionality, such as legacy hardware device emulation. They can also be used for system management tasks. SMIs are similar to NMIs in that they use a special electrical signalling line directly into the CPU, and are generally not able to be masked.
When an SMI is received, the CPU will enter System Management Mode (SMM). In this mode, a very low-level handler routine is run to handle the SMIs. The SMM is typically provided directly from the system management firmware, often the BIOS or the EFI.
SMIs are most often used to provide legacy hardware emulation. A common example is to emulate a floppy disk drive. If there is no floppy disk device attached to the system, a virtualized network-managed emulation can be used instead. When the operating system attempts to access the floppy disk, an SMI is triggered and a handler provides the operating system with an emulated device instead. The operating system then treats the emulation as though it were the legacy device itself.
MRG Realtime can be adversely affected by SMIs because they take place without the direct involvement of the operating system. A poorly written SMI handling routine may consume many milliseconds of CPU time, and the operating system is not able to preempt the handler if it needs to. This situation creates periodic high latencies in an otherwise well-tuned, highly responsive system. Unfortunately, because SMI handlers can be used by a vendor to manage CPU temperature and fan control, it is not possible to disable them. Instead, it is recommended that you notify the vendor of the problem.
You can attempt to isolate SMIs on a MRG Realtime system using the hwlatdetect
utility, which is available in the rt-tests
package. This utility is designed to measure periods of time during which the CPU has been stolen by an SMI handling routine.
3.5. Advanced programmable interrupt controller
The advanced programmable interrupt controller (APIC) was developed by Intel® to provide the ability to handle large amounts of interrupts, to allow each of these to be programmatically routed to a specific set of available CPUs (and for this to be changed accordingly), to support inter-CPU communication, and to remove the need for a large number of devices to share a single interrupt line.
APIC represents a series of devices and technologies that work together to generate, route, and handle a large number of hardware interrupts in a scalable and manageable way. It uses a combination of a local APIC built into each system CPU, and a number of Input/Outpt APICs that are connected directly to hardware devices. When a hardware device generates an interrupt, it is detected by the IO-APIC it is connected to, and then routed across the system APIC bus to a particular CPU. The operating system knows which IO-APIC is connected to which device, and to which particular interrupt line within that device because of a combination of information sources. Firstly, there is the ACPI DSDT (Advanced Configuration and Power Interface Differentiated System Description Table) that includes information about the specific wiring of the host system motherboard and peripheral components. Secondly, a device provides certain information about its available interrupt sources. Together, these two sets of data provide information about the overall interrupt hierarchy.
Complex APIC-based interrupt management strategies are possible, with the system APICs connected in hierarchies, and delivering interrupts to CPUs in a load-balanced fashion rather than targeting a specific CPU or set of CPUs.