This blog forms part of the website which focuses on performance tuning UNIX-like Operating Systems (including Linux).

Wednesday, September 14, 2005


The Hierarchy of Needs

General Theory of Caching
The following sections cover the levels of caching that are present in modern day computer systems. Due to the aggressive increases in processing power, compared to the relative performance of other system components, caching has become a critical element of modern computer systems. The following sections provide a brief overview of caching and should provide enough detail to begin to grasp their impact on performance tuning a modern computer system.

Cache Hierarchies
Caches by their very nature occur in hierarchies. These hierarchies are usually based around latency and bandwidth. Generally the hierarchies consist of more performant (and therefore more expensive) storage, with each node in the hierarchy composed of a larger but slower type of storage to its parent node.

The hierarchies allow for smaller amounts of expensive but faster storage to partially mask the performance of its child node. A cache hit results in the item being retrieved directly from the current node; a cache miss will result in a request to the child node which in turn can cause misses to cascade down the hierarchy.

Temporal & Spatial Locality
Caching is based upon two key principles, both of which sound remarkably like terms from Star Trek, namely temporal and spatial locality. These are simple concepts to grasp, spatial locality is that references to any object will be close by distance and temporal locality is that references to any object will be close by time.

Temporal locality occurs when either instructions or data are re-used within short succession. This type of behaviour is not uncommon due to the amount of commonality in computer programming, for example a variable definition is a memory location that is subject to multiple read and possibly writes accesses. If these accesses occur before the item is flushed from the cache then they will be reloaded from faster cache rather than be faulted to a slower child node.

Spatial locality occurs due to a common phenomenon which is that more data is fetched than is actually required. If the data or instructions are within close proximity of each other then it is likely they will be loaded directly from cache rather than be faulted from a slower child node. This type of caching is helpful when accessing densely packed data structures in a serial manner.

Cache Widths
All types of cache have the concept of a unit size and this is generally the smallest size object that can be addressed in the cache, if the cache supports larger objects, they will be a multiple of this initial unit size. In CPU caches the unit size is referred to as the cache width, in RAM the unit size is called the page size and in the storage world the unit size is referred to as the block size.

The concept of unit size is an abstraction that makes cache management easier, as there are less discreet objects to manage. It also underpins the concept of spatial locality, as if only the exact data that was required was brought into the cache, there would be no opportunity to utilise proximal data. Obviously cache width is an important choice when it comes to tuning spatial locality, as too large a width and the cache hit rate will suffer, too small a width and the cache churn rate will be high, reducing the overall efficiency of the cache.

Latency & Throughput
Latency plays a key role in caching and lack of latency is often the driver for caching in the first place. Cache hierarchies are composed of various objects that have differing latency characteristics, with those displaying the lowest latency usually being at the top of the hierarchy and with the highest latency at the bottom of the hierarchy. For example CPU register access latency is around 0.25 nanoseconds, L1 cache is around 1 nanosecond, L2 around 5-12 nanoseconds, RAM is around 10-60 nanoseconds and disk storage is around 3-10 million nanoseconds.

Latency represents the amount of time it takes between making a request from the current node to the child node of a caching hierarchy and the first bit of the response being received back by the current node from the child node. Throughput can be thought of as how long it takes for the remainder of the response to arrive. This is an important distinction from the more usual definitions of throughput and reveals the exact role of throughput in the caching hierarchy. Both latency and throughput usually decline as they various levels of a caching hierarchy are traversed. Caches should usually be large enough to contain the working set of data from their child node in the caching hierarchy.

Cache Placement & Eviction Policies
Cache placement and eviction policies change significantly at various levels of the cache hierarchy. When latency is the paramount consideration (i.e. at the top of the hierarchy) a concept known as set-associativity is utilised to enhance performance. Associativity divides the cache up into various sets of pages and then certain parts of RAM are mapped into that set. This limits the locations at which pages can be mapped into cache. For example in a two-way associative cache, half of RAM can be mapped into one set of slots and half into the other set of slots.

Surely associativity reduces the efficiency of the cache, as it severely limits where cache misses can be placed? Yes, this is true, however the main driver in caches of this type is latency and associativity is used reduce the potential number of slots, and hence the time, that must be searched to see if a cache hit has occurred.

Assuming that a miss occurred, the next consideration is which line in the cache to evict. There are three common algorithms when discussing caches at the top of the hierarchy and again these tend to be driven by latency considerations. The three algorithms in order of speed are random, FIFO (first in, first out) & LRU (least recently used. Random evictions are usually only used for caches that have to be very low latency as they are very quick to execute but they do not show as much temporal locality as LRU.

As you traverse the memory hierarchy the trade-off moves in the other direction with algorithmic complexity being traded for longer execution times. This is possible as the overall time of a cache miss lengthens, as the cache hierarchy is traversed and the increased execution time represents a smaller proportion of the cache miss latency. It also becomes more expensive to incur a cache miss, as the hierarchy as the cache is traversed due to the increased variance of cache performance towards the bottom of the cache hierarchy.

The final element of caching policy is the cache write policy. This will either be cache write-through or write-back. Cache write-through is where the data is written in both the cache and the cache backing store (i.e. the child node in the cache hierarchy). Cache write-through is simpler to implement and also simplifies cache coherency (see next section). Cache write-back does not write to the cache backing store and therefore allows write combining which is where multiple writes of the same block overwrite each other and only the final result is written back to the child node.

Cache Coherence
When multiple caches are used with a single back-end data source, such as in an SMP system, these caches must be consistent. Various techniques and protocols exist, ranging from cache write-through through to bus snooping.

As previously mentioned cache write-through indicates that writes are made through the cache which simplifies the cache coherency implementation, as other caches do not have to invalidate caches in order to see the writes.

Another common coherency technique is called snooping where hardware assist is used to inspect the addresses crossing the memory bus and local copies are either updated or invalidated in the local cache. Obviously this is not a huge undertaking in the SMP world which has a shared front-side bus but it becomes much trickier in NUMA architectures which may contain cross-bar switches or some other type of complex memory switching infrastructure.

Caching In Modern Computer Systems

The following sections explore the various types of storage in modern computer systems and discuss their role in the caching hierarchy.

Non-Persistent Storage
The following sections describe the various forms on non-persistent storage in modern computers. By non-persistent we mean that the contents of the storage medium do not persist across a machine reset.

CPU Registers
Registers are the fastest form of non-persistent storage in a modern computer system. They are designed to run at the same speed as the processing elements of the CPU. There optimisation and use is in the hands of the compiler. The only opportunity for register optimisation is to buy chips with more of them (or at least take them in to account when considering chip technologies), alternatively it is possible to hand tune machine code although this increasingly an increasingly challenging exercise, as compiler optimisations have come on in leaps and bounds in recent years.

The only large speed increases are likely to come from re-engineering algorithms rather than from the hand tuning of machine code. It should be noted that the x86 chips from Intel have historically been register starved and contain few general purpose registers compared to their RISC counterparts. This has led to specialist compiler optimisations, such as omit frame pointers, which allow certain registers to be freed up and re-used as general registers.

L1, L2, & L3 Cache
The various L1/L2/L3 caches in chips are expensive but relatively fast memory. The initial distinction between L1 and L2 cache was whether they were directly on the chip or on the motherboard. It is now common for both the L1 and L2 to be on the chip itself which has lead to the advent of L3 cache which is now on the motherboard.

A driver for the increased levels of cache has been the recent large improvement in CPU performance which has been largely gained by increased clock speeds. This has only accentuated the difference between the latency of the CPU registers and the latency of RAM. Although there have been performance increases and newer type of memory, these have not been large and have also been more focused around bandwidth than latency. This has made the use of multi-staged caches, consisting of much smaller amounts of faster (and more expensive) memory more attractive to chip designers.

Performance tuning the use of L1/L2/L3 caches cannot be achieved by modifying the caches; their size is fixed by the chip designers. Modern CPUs include counters, that can be statistically sampled, which indicate the number of cache misses. This information can then be used to tune data structures used in programs so that they are effectively aligned with the cache line boundaries.

Main Memory
The latency of main memory is in the region of sixty times slower than L1 cache and two hundred and forty times slower than register access. This why main memory has had increasing levels of cache placed in front of it. Main memory is divided into page frames; in the case of 32-bit X86 these frames are four kilobytes in size and most sixty four-bit platforms use a frame size of eight kilobytes. With the exception of kernel caches and anonymous mapped memory, these page frames are used to cache blocks from persistent storage devices. Anonymous mapped memory is backed by the misnamed ‘swap file’, which provides persistent backing store for anonymous memory created as part of the normal execution of user space processes.

In the case of a hierarchy of caches this allows us to view main memory as a fast cache for persistent storage. Although RAM seems fairly slow when compared to a CPU register, it is extremely fast when compared to persistent storage which tends to be fifty to a hundred times slower than RAM. This has lead to a massive increase in the size of memory being installed into machines. This is partially due to the increases in disk speeds not keeping up with the increase in CPUs and RAM but also due to the large increases in storage density.

Whilst latency demands driven by faster CPUs have lead to greater levels of caching, to an extent, the massive increases in data sets driven by the large increases in storage density have caused the size of RAM to increase, so that the working set of data can be cached more effectively.

Performance tuning main memory is mainly concerned with correctly sizing the amount of main memory and the amount of ‘swap space’. This requires detailed analysis of the machines workload and the requirements of anonymous memory. Most modern day Operating Systems will simply lock anonymous memory into main memory if insufficient ‘swap space’ exists. So to ensure efficient use of main memory the amount of active file system pages, the kernel requirements and the working set of anonymous memory need to be calculated.

Persistent Storage
The following sections describe the various forms of persistent storage in modern computers. By persistent we mean that their contents persist across a machine reset.

The most common and well known forms of persistent storage are probably DAS (Directly Attached Storage), NAS (Network Attached Storage) & SAN (Storage Area Network). DAS is usually implemented using a block level protocol, mainly SCSI and until recently using parallel cables. Parallel cable schemes suffer from an effect known as skew, which eventually makes serial communications faster. Skew is the problem of synchronising the bits that pass down each individual strand of the parallel cable. The time between the receipt of the first and last bit is referred to as the skew of the cable and this effect eventually limits the amount of data that can be dispatched using parallel cabling schemes that use synchronous clocking.

Modern DAS drives are beginning to adopt serial SCSI, as the limits of the parallel wiring schemes have been exceeded. SCSI is a block level protocol and is used in a more conventional networking sense with fibre channel to form SANs. The SCSI protocol has been separated from the underlying transport mechanism to allow it to be used with a serial protocol, a parallel protocol and networking protocols. The most recent addition to the SCSI implementation has been the use of the SCSI block level protocol with the IP protocol to provide the iSCSI storage protocol.

NAS is a different approach to storage which has historically not been as performant as DAS, however modern day NAS implementations are starting to address this. In addition the boundaries between NAS and SANs has become blurred as clustered file systems have been placed on top of SAN based file system to allow multiple hosts to access the same file system. Both NAS and clustered file systems are complicated by locking and data integrity issues.

The use of DAS as a fast cache for distributed file system has appeared in a number of file system implementations, most notably AFS and CacheFS in which both file system types were used to cache data from remote NAS type services. However since the speed of networking interfaces increased and the performance of NAS started to rival DAS these methods are not widely used.

Caching has begun to see a resurgence in the world of disk based persistent storage since data volumes have ballooned and near-line storage has been developed to alleviate the pressures for storing large quantities of data. This new trend of caching is referred to as ILM (Information Lifecycle Management).

When tuning DAS, SAN and NAS much of the effort goes into ensuring that the various I/O layers are aligned to make efficient use of both upper level caches and to minimise data retrieval from the devices. This usually manifests itself in choices such as file system block size and ensuring that the largest possible I/O operations can be passed through the various file system and drivers levels.

In the case of DAS and SAN some tuning effort also goes into the efficient use of track caches, particularly in RAID devices. Each disk has a cache that allows an entire track to be read from the disk, generally the stripe widths of RAID devices should be aligned with the track cache sizes and a good value may be the number of columns in the stripe multiplied by the track size.

When considering protocols that are likely to be packet switched reducing the amount of protocol overhead often becomes a key tuning criteria. In terms of SAN and NAS this is often a consideration that guides the choice of file system blocks sizes and various tuning decisions made concerning the networking protocol used to transfer the data.

The advent of SATA (Serial ATA) and the use of the less performant disk technology have led to a new generation of less costly storage arrays. Although the arrays are generally less expensive, they are also less performant, meaning that to get reasonable performance from these devices, it is often necessary to place caching devices in front of them. Due to the recent appearance of these devices and the lack of sophistication of the new ILM products these devices are currently viewed more as disk based backup devices rather than as part of a cache hierarchy.

As ILM products mature this differentiation is expected to lessen and the products are expected to be much more tightly integrated into the caching hierarchy. The relatively poor performance and poor parallelism of these devices can easily be overcome by the reintroduction of technologies such as CacheFS back into the storage stack.

Generally the same techniques and rules can be used to tune near line storage as any other type of storage. Focus should be given to reducing protocol overhead, optimising I/O block sizes and aligning data.

Cache Observability & Tuning
Various tools and techniques exit for tuning cache performance in both Solaris and Linux. In each Operating System we look at two tools which are roughly equivalent and then present techniques to use these tools to enhance cache performance.

Linux contains the usual set of UNIX tools that show aggregated system performance such as vmstat, iostat and sar. The kernel tends to have quite ad-hoc instrumentation that provides a basic insight into the behaviour of the kernel and associated processes.

As Linux matures the observability of the system is increasing and improving. The likely improvements in the future will be mainly in the area of dynamic instrumentation. Currently the Linux Trace Toolkit is being revised for acceptance into the kernel which would provide a method speculatively tracing the kernel and in addition work continues with Systemtap which aims to deliver dtrace-like functionality in the Linux kernel.

Whilst all of these tools are valuable for estimating working set sizes and studying systemic behaviour the major method for analysing CPU cache performance is to use the CPU counters that are prevalent on modern CPUs.

Oprofile is a statistical profiler that can be used to study the performance of CPU related system activity. It works by registering a handler in the IDT (Interrupt Descriptor Table) and then according to the specified sample rate, causes NMIs (Non-Maskable Interrupts) to be generated by the system when particular CPU monitors overflow. When they overflow the handler is invoked which samples the program counter of the CPU and attributes the sample to the running code.

This type of profiling is referred to as statistical profiling, as the result is statistically valid rather than deterministic. This means that an obvious trade off exists with Oprofile (and all statistical samplers) which is that the more samples you take the more accurate the result is likely to be, although the more samples you take the more intrusive the sample method is (however it should be noted that even at high sample rates Oprofile is still considered to have a very low overhead). Taking exceptionally small samples is discouraged as there is a small chance that you will choose a sample period that is shorter than the NMI handler execution time which will cause the system to hang, looping in the handler code.

Oprofile can be used to sample any of the available CPU counters. It should be noted that RedHat have disabled the necessary ACPI support in their uniprocessor kernels due to issues seen with particular laptop models. Therefore to use Oprofile on a uniprocessor machine you will either have to boot the SMP kernel or recompile the uniprocessor kernel with the necessary ACPI support.

To study cache performance, choose the relevant counter for the chip you are working with, if we are looking at cache miss events then we would need to ascertain which counters to use. These counters allow the study of both data and instruction cache misses. A list of the available counters can be obtained using the following command:

# opcontrol –list-events

See the following reference for the chip you wish to profile:

Pentium IV
Pentium IV HT
AMD Athlon
AMD 64

To start the Oprofile collection session you must first initiate Oprofile with:

# opcontrol –init

This loads the associated kernel modules if they are not already loaded. Next Oprofile must be given a basic working configuration using:

# opcontrol --setup –separate=library --no-vmlinux --event= BSQ_CACHE_REFERENCE:1000:0x10f:1:1

This configures Oprofile for the Pentium IV and provides sampling of the data cache misses. The –event identifier breaks down as [Counter Name]:[Sample Overflow Value]:[Unit Mask]:[Profile Kernel Mode]:[Profile User Mode]. The –separate=library simply means that library samples should be attributed to the binary which invoked them.

Once your basic Oprofile configuration is in place then the sampling can be started using:

# opcontrol –start

And once everything of interest has finished then Oprofile sampling can be stopped and flushed using:

# opcontrol –stop ; opcontrol --dump

Once you have a valid set of samples these can be analysed using the opreport command. To begin to drill down on the data try:

# opreport –long-filenames

This will give a basic report, categorised by filename, showing which programs caused the cache misses that you sampled above. Once you have identified a file name of interest this can be studied in more depth using:

# opreport –l image:[path name from previous step]

This will then give you a report in which the various symbols in the executable are listed, along with details of how many cache misses they caused. If you are armed with the code and can recompile with the –g flag to gcc, which adds additional debug symbols, then try:

# opreport –l –g image:[path name from previous step]

This will give a report showing the symbols and misses annotated with the original source.

The above is an extremely basic introduction to Oprofile and does not do it anywhere near justice as a tool. To learn how to use Oprofile in anger try reading some of the links available at:

Oprofile has many uses beyond the narrow use presented here and any time spent learning the tool will reap benefits in a large variety of performance tuning situations. A more in-depth tutorial will be provided at a later date.

Valgrind has a number of modes, one of which can be used to model cache misses from the various CPU caches. Valgrind works by simulating the execution of your code and noting the cost of cache misses. This allows for a number or programming errors to be detected and also for the cache behaviour to be modelled. To study the cache usage of a program use:

# valgrind –tool=cachegrind [command to profile]

This will print some summary statistics and create a file called cachegrind.out.[process id]. The output file will contain line by line statistics and whilst human readable, it is probably best interpreted using the supplied cg_annotate program.

Again this is just a brief smattering of the capabilities of valgrind and more information can be obtained by reading some of the links available at:

Valgrind has a number of modes not even touched on here and any time spent learning the tool will reap benefits in a large variety of performance tuning and debugging scenarios. A more in-depth tutorial will be provided at a later date.

OpenSolaris also contains the usual set of performance analysis tools such as sar, iostat, vmstat, etc. However these tools tend to be a little more detailed under OpenSolaris (although the sar version does not have an equivalent to the Linux –n option which is a shame). OpenSolaris over time has also acquired number of tools and APIs that are very useful but do not seemed to have gained huge traction outside of Sun, for example TNF tracing, hardware performance counter APIs, busstat, lockstat and to an extent kstat. This is a shame because they are all excellent observability tools and add to the rich set of systemic information that can be acquired under OpenSolaris (TNF even has elements of deterministic profiling, although the lack of aggregation complicates things).

All of these tools were over shadowed by dtrace which provides not only systemic information but allows for deterministic profiling. This allows for new tools such as an I/O based version of top to be constructed. This allows for the gradual enumeration of problems, so we can transition from “I have lots of I/O” to “I have lots of I/O because of PID X”. However the real strength of dtrace is that we can look at process X in a deterministic manner and ascertain why it is generating large amounts of I/O.

Dtrace achieves this be providing production safe probe points in the kernel that are triggered and aggregated by scripts which are loaded into a VM inside the kernel. The VM provides certain safety features and portability advantages. The safety features and in-kernel providers approach allow dtrace to be ‘crash proof’, assuming there are no bugs in the VM or the providers. The safety features have been proven in real Production environments where dtrace has seen wide scale use.

Sun has gradually being revealing various types of hardware counters in its hardware architecture. The best sets of counters are available in its SPARC based hardware and it will be interesting to see if Sun extends this into the AMD/X86 based models. Aside from the standard CPU counters showing instruction and data cache misses, Sun revealed hardware counters in its bus architectures.

Whilst this section will predominantly look at the CPU counters it is worth taking a quick look at the HAR tool which is available from This tool gives a great overview of the whole range of hardware counters in your system (assuming it is SPARC based). For general monitoring and focusing your sights on the appropriate problem domain, this is an invaluable tool.

Cputrack allows for the sampling of two concurrent counters. This means that results from cputrack must be multiplexed from multiple runs, if more than two counters are required. This can be challenging if the process under study either varies between runs or is short running. If the process under study varies largely between runs it is debatable what will be achieved by attempting to tune the process, as performance tuning makes assumptions about future behaviour based on past.

The documentation for cputrack indicates that support exists for both the SPARC and Intel architectures, it is not clear which IA chips are supported but the most common are documented along with their counters at:

Pentium IV
Pentium IV HT
AMD Athlon
AMD 64

The SPARC architecture’s counters are documented at:

SPARC IV (Sec. 6)
SPARC III (Sec. 14)

Cputrack can be used to trace processes using the following generic command line:

# cputrack –c [CPU Counter] –t –T 1 –o [Output File] [Command To Analyse]

It can additionally it can be used to attach to a running process using the following generic command line:

# cputrack –c [CPU Counter] –t –T 1 –o [Output File] –p [Process ID]

Both of these invocations place their output into the file specified by [Output File]. The output will be a simple table listing the time, the process ID, the lightweight process ID, the number of cycles and the count for the selected counter during the sample period.

Whilst cputrack is a useful tool it is not as configurable or flexible as oprofile. It is not possible to separate out the various sample elements such as libraries, kernel routines, etc. However this lack of complexity makes cputrack a much simpler tool to use and it is only slightly more complex than using truss.

Further information and articles pertaining to cputrack can be found at:

This section has been a brief introduction to cputrack only. It is recommended that you read the associated information and experiment with cputrack to fully understand its use and feature set.

Sun Studio
Sun Studio builds upon the hardware counter facilities utilised by cputrack and provides automated performance tuning features. These features initially monitor the hardware counters, profiling the application and then use the output of the profiling process to apply major optimisations to the executed binary.

The first technique uses mapfiles to instruct the linker to re-organise binaries in memory to co-locate hot functions in an attempt to exploit spatial and temporal locality. This optimisation technique is useful for optimising at the function level. The following series of commands can be executed to use this optimisation technique with the Sun Studio compiler:

# cc -O -xF -o [Program Name] *.c
# collect [Program Name]
# er_print -mapfile [Program Name] [Program Name].map [Program Name].err
# cc -O -xF -M [Program Name].map -o [Program Name] *.c

The second technique that can be applied to improve cache performance is to optimise at the instruction level. This technique groups together frequently used instructions, again with the goal of exploiting spatial and temporal locality. The following series of commands can be executed to use this optimisation technique with the Sun Studio compiler:

# cc -O -xprofile=collect:[Program Name].profile -xipo -o [Program Name] *.c
# [Program Name]
# cc -O -xprofile=use:[Program Name].profile -xipo -o [Program Name] *.c

Mapfiles work well with functions, whilst profile feedback works well at the instruction level, so it seems common sense to want to optimise using both techniques. This is referred to as link optimisation and this technique goes further than the previous two optimisations. The technique identifies both hot code and hot instructions within functions but then allows, with a technique similar to mapfiles, for the hot instructions to be grouped together for optimum efficiency. This greatly enhances cache utilisation and efficiency. To utilise this technique compile as follows:

# cc -O -xprofile=collect:[Program Name].profile -xipo -o [Program Name] *.c
# [Program Name]
# cc -O -xprofile=use:[Program Name].profile -xipo -o [Program Name] *.c –xlinkopt

This will ensure that hot instruction areas are coalesced for maximum cache performance. The above is an extremely basic introduction to the subject of performance tuning code on OpenSolaris. Further information on this subject area is available at:

Hopefully at some point in the future I will be able to revisit this topic and provide more in-depth information and content.

In this article I have outlined the concepts of cache hierarchies and how they are fundamental to the performance of computers. I have then focused on performance tuning CPU caches and have explored tools available in both OpenSolaris and Linux for achieving this task. The tools I have focused most heavily on utilise the statistical counters available on an increasing range of hardware elements, as well as two profiling approaches.

These tools are just a selection of those available. In the future I will write up more detailed articles on the tools described here, as well as some of the other tools available such as dtrace, systemtap, vtune and the various linker options available.

For further information concerning performance tuning please see An audio version of this article is avaiable.

Comments: Post a Comment

Links to this post:

Create a Link

<< Home


November 2004   June 2005   July 2005   August 2005   September 2005   October 2005   November 2005   December 2005   March 2006   April 2006  

This page is powered by Blogger. Isn't yours?