performanceGuru.com

This blog forms part of the http://performanceGuru.com website which focuses on performance tuning UNIX-like Operating Systems (including Linux).

Wednesday, September 28, 2005

 

Article: The Future Of Microprocessors...

There is yet another excellent article in ACM Queue outlining the past, current and future state of processors. One of the co-authors of the article is Kunle Olukton who founded Afara Websystems which was purchased by Sun Microsystems. The technology developed by Afara prior to the purchase was later to become Niagra and the article focuses closely on CMP (Chip Multi Processing) as well as SMT (Simultaneous Multithreading).

The article covers the triad of modern chip design clock frequency, superscalar instruction issue and multi-threading, revealing the delicate trade-offs that are now in place and likely future directions now that two of the three avenues have been fully exploited. The article covers the background of Intel, AMD and Sun's current processor design direction and if you play close attention there might be a taste of what is in the forthcoming Rock CMP/SMT play from Sun Microsystems.

All-in-all a recommended read!

del.icio.us Tags:

Tuesday, September 27, 2005

 

Hyperthreading Webcast ...

On September 28, 2005, the HyperTransport Consortium will
conduct a live web cast and will compare and contrast traditional
serial and parallel bus characteristics and illustrate how blended
architectures can combine the best of both to deliver a balance of
features and capabilities for their intended applications.
Register today for the online event that will help boost your
general knowledge of bus interconnects:

Register


del.icio.us Tags:

 

Reconfigurable Computing On OpenFPGA...

Recently I have been researching Reconfigurable Computing centering around the use of FPGAs. With the break down of Moore's Law due to cost ("High clock rate is a cost, not a benefit; it drives up costs of everything else in the machine"), as well as heat and power, the use of FPGA has become more attractive.

It seems in the modern technology landscape that when a technology matures and start to go mainstream someone adds the moniker "Open", so in true to this, I have been looking at the OpenFPGA site. The most useful section on the site is the presentations section which contains much useful information.

Current FPGA plays include:

- SGI's RASC
- Cray's Application Acceleration System

It will also be interesting to see if IBM's Cell Processor becomes a Reconfigurable Computing play - their coming Linux support with spuFS hints at this, particularly with the bladeCenter based prototype system.


Monday, September 26, 2005

 

Linux: Gartner VP Announces Death Of UNIX...

Gartner VP, Donald Feinberg has announced the death of UNIX at a recent conference in India. Not sure if Apple or Sun agree with this statement, both of whom have recently claimed to have the world's most advanced Operating System with their respective OS/X and Solaris products. I for one had been quite enthused by the emergence of OS/X and the re-emergence of Solaris, feeling they were likely to drive innovation in the diverse Desktop and Server markets giving Linux a run for its money...


Sunday, September 25, 2005

 

Grid: Good Article In ACM Queue...

There is a great grid article in ACM Queue by Paul Strong of Sun Microsystems. This is a great read and I love the diagram at the bottom of page four which provides a wonderful overview of where all the various elements of the data centre technology stack sit.


Monday, September 19, 2005

 

OpenSolaris: IP Tunneling Design Document Release...

Following the release of the IPMP design document last week the Clearview Team have released their design document for IP Tunneling.


Saturday, September 17, 2005

 

Netcat Added To My Tools Page...

Added Netcat to the tools page of my website. Netcat is often described as the 'Swiss army penknife' of networking tools - its very useful for network debugging and testing. The description on my site is as follows:

Netcat is a featured networking utility which reads and writes data across network connections, using the TCP/IP protocol. It is designed to be a reliable "back-end" tool that can be used directly or easily driven by other programs and scripts. At the same time, it is a feature-rich network debugging and exploration tool, since it can create almost any kind of connection you would need and has several interesting built-in capabilities.


 

VMWare Releases New Version Of ESX Server...

VMWare has quietly released a new version of ESX server, the release notes indicate that its just an update release that supports more software and hardware. A quick look at the supported Operating Systems rather disappointingly reveals that ESX server still does not officially support Solaris surely there must be more demand for this than FreeBSD and Novell Netware (No offence intended to advocates of these Operating Systems but Solaris is more common in most Enterprises)...


 

Web Services For Machine Management...

There's an interesting article in ACM Queue about WSDM-MUWS and the prevailing standards for the management of hosts using Web Services and how this shapes up against traditional management technologies, such as SNMP. Quite interesting, one aspect that is not covered is the potential for integration of various existing management technologies using ESB (Enterprise Service Bus) technologies.

This is an area that I have considered for some time now where I can imagine applications registering callbacks with disparate alerting services. This would mean rather than paging support staff in the middle of the night because an application was running low on disk space that the application could raise an internal PO to the ERP system, have its budget checked, receive an authorisation token to present to the Storage System that would then provision the required space and finally acknowledge the alert.

Once systems management begins to integrate with current application architecture the automation opportunities are huge. Anyway enough dreaming, back to reality and manual provisioning, read the article its worth the effort.

del.icio.us Tags:

 

PDQ Added To My Tools Page...

Added PDQ to the tools page of my website. PDQ is covered in Neil Gunther's book "The Practical Performance Analyst", which is an excellent text covering Erlang's concept of queuing theory and its practical uses in capacity planning and performance tuning. The description on my site is as follows:

PDQ (Pretty Damn Quick) is a queuing model solver, not a simulator. The queuing theory models discussed in Chapters 2 and 3 of The Practical Performance Analyst are incorporated into the solution methods used by PDQ. This saves you the labour of implementing this code each times you construct a queuing model and allows you to concentrate on the accuracy of your model rather than the accuracy of the solutions.


 

Sorry For The Confusion...

There seems to have been some recent confusion that my comment announcing the availability of a patch against the 2.6.13 kernel, constituted a production endorsement of Systemtap. In case anyone else is not familiar with how the Linux Development process works, here goes:

- a patch is generated against the most recently released mainline kernel from Linus's tree, such as the one I cited above.
- this is sent to the mailing lists for the areas of interest associated with the patch, such as the Systemtap lists in this case.
- the patch is peer reviewed by the members of that group, which happened in this case and a number of issues where uncovered.
- the patch is the given wider distribution on LKML where is reviewed by a much larger group of peers.
- after this process is complete, if there is a general appetite, the patch is submitted to Andrew Morton and enters his -mm tree.
- once Andrew has the patch merged and is happy that is should enter the mainline, it enters Linus's tree, where it will go through maybe upto four release candidates as the release stabilises and is tested by a much wider group of people.
- Once the patch is in Linus's tree it may then be selected for inclusion into vendor kernels, or perhaps backported into earlier kernels by the various vendors depending on its value to them.

Obviously, as the post I made pointed people to what was clearly an initial patch release and a request for comments, I assumed I did not need to point out that this was not 'production ready', after all I have recently pointed out in my blog that Systemtap is in early development.

Hope this clear things up - please feel free to post comments if you have any particular questions.

del.icio.us Tags:

Friday, September 16, 2005

 

Project R Added To Tools...

Started a new section in the tools section of my website called data tools and added the R Project with the following descritption:

This project produces a statistical graphing language, that can be used to plot graphs. The tool provides a number of statistical methods such as linear and nonlinear modelling, classical statistical tests, time-series analysis, classification and clustering, as well as graphical methods.

del.icio.us Tags:

 

Systemtap Article In Redhat Magazine...

Will Cohen has written a Systemtap article for Redhat Magazine. The article describes how to get Systemtap running on FC4 and then goes through some scripts - a worthwhile read.


Wednesday, September 14, 2005

 

Sun's New Galaxy Servers To Run Linux..

Now that I have your attention, yes - its sort of true, the embedded daughterboard, designed by Sun, that provides the ILOM facilities runs embedded Linux. Anyway this gem was from "Sun FireTM X4100 and X4200 Server Architectures - A Technical White Paper". Its a very worthwhile read, giving an excellent background on these new servers.

It focuses quite closely on an idealised Intel architecture, ignoring the X-Architecure chipsets from IBM which deals with the northbridge/southbridge stuff. I'm sure this is good for beating up Dell but could be somewhat of an over-simplification of modern Intel architectures.

However, the TCO stuff brought up in the media by Sun is interesting (twice the perofrmance, half the power = quarter of the power), although I would like to see some data to back it all up - I will go trawl for the data this weekend, I am always somewhat cynical of these large claims concerning performance and power, until I see the configurations and the numbers (if anyone know where these are then please leave a little something in the comments and I will update this post).

After looking at the whitepapers I was quite impressed with the basic architecture, seems very solid. The following things impressed:

- use of IPMI 2.0 (assuming its stable/bug free)
- use of standardised components across the architecture (assuming this continues beyond the first two models)
- availability of IPMI options (although this is fairly standard in most of the current generation of servers)
- the attention to ILOM and the many options available
- support for SMASH/CLP

Things that failed to impress were:

- having to use a dedicated management port (this seemed like a poor decision when we are trying to drive up utilisation - better a VLAN on one of the Intel 82546EB)
- placing a card slower than 100Mhz in PCI slot zero will cause the whole bus to clock at that speed, effecting the performance of the LSI SAS1064
- forcing the use of SAS drives and not supporting SATA on the LSI SAS1064

All in all seems like a reasonable first crack at the "Standard Server Market" but I'm not sure if its really plane flying material. No doubt I will get the chance to benchmark one of these in the near future, so maybe I will be more impressed after I have done this - I open to being more impressed!


 

Kprobes Adds Uerspace Probing...

Prasanna Panchamukhi announced today that he has produced a patch against the 2.6.13 kernel which introduces the ability to probe user space. This functionality will provide one of the current major missing pieces in Systemtap.

For an introduction to Kprobes see the introductory article on LWN.

(If you are reading this post because this constitues a production endorsement of Systemtap then please see my further comments).


 

The Hierarchy of Needs

General Theory of Caching
The following sections cover the levels of caching that are present in modern day computer systems. Due to the aggressive increases in processing power, compared to the relative performance of other system components, caching has become a critical element of modern computer systems. The following sections provide a brief overview of caching and should provide enough detail to begin to grasp their impact on performance tuning a modern computer system.

Cache Hierarchies
Caches by their very nature occur in hierarchies. These hierarchies are usually based around latency and bandwidth. Generally the hierarchies consist of more performant (and therefore more expensive) storage, with each node in the hierarchy composed of a larger but slower type of storage to its parent node.

The hierarchies allow for smaller amounts of expensive but faster storage to partially mask the performance of its child node. A cache hit results in the item being retrieved directly from the current node; a cache miss will result in a request to the child node which in turn can cause misses to cascade down the hierarchy.

Temporal & Spatial Locality
Caching is based upon two key principles, both of which sound remarkably like terms from Star Trek, namely temporal and spatial locality. These are simple concepts to grasp, spatial locality is that references to any object will be close by distance and temporal locality is that references to any object will be close by time.

Temporal locality occurs when either instructions or data are re-used within short succession. This type of behaviour is not uncommon due to the amount of commonality in computer programming, for example a variable definition is a memory location that is subject to multiple read and possibly writes accesses. If these accesses occur before the item is flushed from the cache then they will be reloaded from faster cache rather than be faulted to a slower child node.

Spatial locality occurs due to a common phenomenon which is that more data is fetched than is actually required. If the data or instructions are within close proximity of each other then it is likely they will be loaded directly from cache rather than be faulted from a slower child node. This type of caching is helpful when accessing densely packed data structures in a serial manner.

Cache Widths
All types of cache have the concept of a unit size and this is generally the smallest size object that can be addressed in the cache, if the cache supports larger objects, they will be a multiple of this initial unit size. In CPU caches the unit size is referred to as the cache width, in RAM the unit size is called the page size and in the storage world the unit size is referred to as the block size.

The concept of unit size is an abstraction that makes cache management easier, as there are less discreet objects to manage. It also underpins the concept of spatial locality, as if only the exact data that was required was brought into the cache, there would be no opportunity to utilise proximal data. Obviously cache width is an important choice when it comes to tuning spatial locality, as too large a width and the cache hit rate will suffer, too small a width and the cache churn rate will be high, reducing the overall efficiency of the cache.

Latency & Throughput
Latency plays a key role in caching and lack of latency is often the driver for caching in the first place. Cache hierarchies are composed of various objects that have differing latency characteristics, with those displaying the lowest latency usually being at the top of the hierarchy and with the highest latency at the bottom of the hierarchy. For example CPU register access latency is around 0.25 nanoseconds, L1 cache is around 1 nanosecond, L2 around 5-12 nanoseconds, RAM is around 10-60 nanoseconds and disk storage is around 3-10 million nanoseconds.

Latency represents the amount of time it takes between making a request from the current node to the child node of a caching hierarchy and the first bit of the response being received back by the current node from the child node. Throughput can be thought of as how long it takes for the remainder of the response to arrive. This is an important distinction from the more usual definitions of throughput and reveals the exact role of throughput in the caching hierarchy. Both latency and throughput usually decline as they various levels of a caching hierarchy are traversed. Caches should usually be large enough to contain the working set of data from their child node in the caching hierarchy.

Cache Placement & Eviction Policies
Cache placement and eviction policies change significantly at various levels of the cache hierarchy. When latency is the paramount consideration (i.e. at the top of the hierarchy) a concept known as set-associativity is utilised to enhance performance. Associativity divides the cache up into various sets of pages and then certain parts of RAM are mapped into that set. This limits the locations at which pages can be mapped into cache. For example in a two-way associative cache, half of RAM can be mapped into one set of slots and half into the other set of slots.

Surely associativity reduces the efficiency of the cache, as it severely limits where cache misses can be placed? Yes, this is true, however the main driver in caches of this type is latency and associativity is used reduce the potential number of slots, and hence the time, that must be searched to see if a cache hit has occurred.

Assuming that a miss occurred, the next consideration is which line in the cache to evict. There are three common algorithms when discussing caches at the top of the hierarchy and again these tend to be driven by latency considerations. The three algorithms in order of speed are random, FIFO (first in, first out) & LRU (least recently used. Random evictions are usually only used for caches that have to be very low latency as they are very quick to execute but they do not show as much temporal locality as LRU.

As you traverse the memory hierarchy the trade-off moves in the other direction with algorithmic complexity being traded for longer execution times. This is possible as the overall time of a cache miss lengthens, as the cache hierarchy is traversed and the increased execution time represents a smaller proportion of the cache miss latency. It also becomes more expensive to incur a cache miss, as the hierarchy as the cache is traversed due to the increased variance of cache performance towards the bottom of the cache hierarchy.

The final element of caching policy is the cache write policy. This will either be cache write-through or write-back. Cache write-through is where the data is written in both the cache and the cache backing store (i.e. the child node in the cache hierarchy). Cache write-through is simpler to implement and also simplifies cache coherency (see next section). Cache write-back does not write to the cache backing store and therefore allows write combining which is where multiple writes of the same block overwrite each other and only the final result is written back to the child node.

Cache Coherence
When multiple caches are used with a single back-end data source, such as in an SMP system, these caches must be consistent. Various techniques and protocols exist, ranging from cache write-through through to bus snooping.

As previously mentioned cache write-through indicates that writes are made through the cache which simplifies the cache coherency implementation, as other caches do not have to invalidate caches in order to see the writes.

Another common coherency technique is called snooping where hardware assist is used to inspect the addresses crossing the memory bus and local copies are either updated or invalidated in the local cache. Obviously this is not a huge undertaking in the SMP world which has a shared front-side bus but it becomes much trickier in NUMA architectures which may contain cross-bar switches or some other type of complex memory switching infrastructure.

Caching In Modern Computer Systems

The following sections explore the various types of storage in modern computer systems and discuss their role in the caching hierarchy.

Non-Persistent Storage
The following sections describe the various forms on non-persistent storage in modern computers. By non-persistent we mean that the contents of the storage medium do not persist across a machine reset.

CPU Registers
Registers are the fastest form of non-persistent storage in a modern computer system. They are designed to run at the same speed as the processing elements of the CPU. There optimisation and use is in the hands of the compiler. The only opportunity for register optimisation is to buy chips with more of them (or at least take them in to account when considering chip technologies), alternatively it is possible to hand tune machine code although this increasingly an increasingly challenging exercise, as compiler optimisations have come on in leaps and bounds in recent years.

The only large speed increases are likely to come from re-engineering algorithms rather than from the hand tuning of machine code. It should be noted that the x86 chips from Intel have historically been register starved and contain few general purpose registers compared to their RISC counterparts. This has led to specialist compiler optimisations, such as omit frame pointers, which allow certain registers to be freed up and re-used as general registers.

L1, L2, & L3 Cache
The various L1/L2/L3 caches in chips are expensive but relatively fast memory. The initial distinction between L1 and L2 cache was whether they were directly on the chip or on the motherboard. It is now common for both the L1 and L2 to be on the chip itself which has lead to the advent of L3 cache which is now on the motherboard.

A driver for the increased levels of cache has been the recent large improvement in CPU performance which has been largely gained by increased clock speeds. This has only accentuated the difference between the latency of the CPU registers and the latency of RAM. Although there have been performance increases and newer type of memory, these have not been large and have also been more focused around bandwidth than latency. This has made the use of multi-staged caches, consisting of much smaller amounts of faster (and more expensive) memory more attractive to chip designers.

Performance tuning the use of L1/L2/L3 caches cannot be achieved by modifying the caches; their size is fixed by the chip designers. Modern CPUs include counters, that can be statistically sampled, which indicate the number of cache misses. This information can then be used to tune data structures used in programs so that they are effectively aligned with the cache line boundaries.

Main Memory
The latency of main memory is in the region of sixty times slower than L1 cache and two hundred and forty times slower than register access. This why main memory has had increasing levels of cache placed in front of it. Main memory is divided into page frames; in the case of 32-bit X86 these frames are four kilobytes in size and most sixty four-bit platforms use a frame size of eight kilobytes. With the exception of kernel caches and anonymous mapped memory, these page frames are used to cache blocks from persistent storage devices. Anonymous mapped memory is backed by the misnamed ‘swap file’, which provides persistent backing store for anonymous memory created as part of the normal execution of user space processes.

In the case of a hierarchy of caches this allows us to view main memory as a fast cache for persistent storage. Although RAM seems fairly slow when compared to a CPU register, it is extremely fast when compared to persistent storage which tends to be fifty to a hundred times slower than RAM. This has lead to a massive increase in the size of memory being installed into machines. This is partially due to the increases in disk speeds not keeping up with the increase in CPUs and RAM but also due to the large increases in storage density.

Whilst latency demands driven by faster CPUs have lead to greater levels of caching, to an extent, the massive increases in data sets driven by the large increases in storage density have caused the size of RAM to increase, so that the working set of data can be cached more effectively.

Performance tuning main memory is mainly concerned with correctly sizing the amount of main memory and the amount of ‘swap space’. This requires detailed analysis of the machines workload and the requirements of anonymous memory. Most modern day Operating Systems will simply lock anonymous memory into main memory if insufficient ‘swap space’ exists. So to ensure efficient use of main memory the amount of active file system pages, the kernel requirements and the working set of anonymous memory need to be calculated.

Persistent Storage
The following sections describe the various forms of persistent storage in modern computers. By persistent we mean that their contents persist across a machine reset.

DAS, NAS & SAN
The most common and well known forms of persistent storage are probably DAS (Directly Attached Storage), NAS (Network Attached Storage) & SAN (Storage Area Network). DAS is usually implemented using a block level protocol, mainly SCSI and until recently using parallel cables. Parallel cable schemes suffer from an effect known as skew, which eventually makes serial communications faster. Skew is the problem of synchronising the bits that pass down each individual strand of the parallel cable. The time between the receipt of the first and last bit is referred to as the skew of the cable and this effect eventually limits the amount of data that can be dispatched using parallel cabling schemes that use synchronous clocking.

Modern DAS drives are beginning to adopt serial SCSI, as the limits of the parallel wiring schemes have been exceeded. SCSI is a block level protocol and is used in a more conventional networking sense with fibre channel to form SANs. The SCSI protocol has been separated from the underlying transport mechanism to allow it to be used with a serial protocol, a parallel protocol and networking protocols. The most recent addition to the SCSI implementation has been the use of the SCSI block level protocol with the IP protocol to provide the iSCSI storage protocol.

NAS is a different approach to storage which has historically not been as performant as DAS, however modern day NAS implementations are starting to address this. In addition the boundaries between NAS and SANs has become blurred as clustered file systems have been placed on top of SAN based file system to allow multiple hosts to access the same file system. Both NAS and clustered file systems are complicated by locking and data integrity issues.

The use of DAS as a fast cache for distributed file system has appeared in a number of file system implementations, most notably AFS and CacheFS in which both file system types were used to cache data from remote NAS type services. However since the speed of networking interfaces increased and the performance of NAS started to rival DAS these methods are not widely used.

Caching has begun to see a resurgence in the world of disk based persistent storage since data volumes have ballooned and near-line storage has been developed to alleviate the pressures for storing large quantities of data. This new trend of caching is referred to as ILM (Information Lifecycle Management).

When tuning DAS, SAN and NAS much of the effort goes into ensuring that the various I/O layers are aligned to make efficient use of both upper level caches and to minimise data retrieval from the devices. This usually manifests itself in choices such as file system block size and ensuring that the largest possible I/O operations can be passed through the various file system and drivers levels.

In the case of DAS and SAN some tuning effort also goes into the efficient use of track caches, particularly in RAID devices. Each disk has a cache that allows an entire track to be read from the disk, generally the stripe widths of RAID devices should be aligned with the track cache sizes and a good value may be the number of columns in the stripe multiplied by the track size.

When considering protocols that are likely to be packet switched reducing the amount of protocol overhead often becomes a key tuning criteria. In terms of SAN and NAS this is often a consideration that guides the choice of file system blocks sizes and various tuning decisions made concerning the networking protocol used to transfer the data.

Near-Line
The advent of SATA (Serial ATA) and the use of the less performant disk technology have led to a new generation of less costly storage arrays. Although the arrays are generally less expensive, they are also less performant, meaning that to get reasonable performance from these devices, it is often necessary to place caching devices in front of them. Due to the recent appearance of these devices and the lack of sophistication of the new ILM products these devices are currently viewed more as disk based backup devices rather than as part of a cache hierarchy.

As ILM products mature this differentiation is expected to lessen and the products are expected to be much more tightly integrated into the caching hierarchy. The relatively poor performance and poor parallelism of these devices can easily be overcome by the reintroduction of technologies such as CacheFS back into the storage stack.

Generally the same techniques and rules can be used to tune near line storage as any other type of storage. Focus should be given to reducing protocol overhead, optimising I/O block sizes and aligning data.

Cache Observability & Tuning
Various tools and techniques exit for tuning cache performance in both Solaris and Linux. In each Operating System we look at two tools which are roughly equivalent and then present techniques to use these tools to enhance cache performance.

Linux
Linux contains the usual set of UNIX tools that show aggregated system performance such as vmstat, iostat and sar. The kernel tends to have quite ad-hoc instrumentation that provides a basic insight into the behaviour of the kernel and associated processes.

As Linux matures the observability of the system is increasing and improving. The likely improvements in the future will be mainly in the area of dynamic instrumentation. Currently the Linux Trace Toolkit is being revised for acceptance into the kernel which would provide a method speculatively tracing the kernel and in addition work continues with Systemtap which aims to deliver dtrace-like functionality in the Linux kernel.

Whilst all of these tools are valuable for estimating working set sizes and studying systemic behaviour the major method for analysing CPU cache performance is to use the CPU counters that are prevalent on modern CPUs.

Oprofile
Oprofile is a statistical profiler that can be used to study the performance of CPU related system activity. It works by registering a handler in the IDT (Interrupt Descriptor Table) and then according to the specified sample rate, causes NMIs (Non-Maskable Interrupts) to be generated by the system when particular CPU monitors overflow. When they overflow the handler is invoked which samples the program counter of the CPU and attributes the sample to the running code.

This type of profiling is referred to as statistical profiling, as the result is statistically valid rather than deterministic. This means that an obvious trade off exists with Oprofile (and all statistical samplers) which is that the more samples you take the more accurate the result is likely to be, although the more samples you take the more intrusive the sample method is (however it should be noted that even at high sample rates Oprofile is still considered to have a very low overhead). Taking exceptionally small samples is discouraged as there is a small chance that you will choose a sample period that is shorter than the NMI handler execution time which will cause the system to hang, looping in the handler code.

Oprofile can be used to sample any of the available CPU counters. It should be noted that RedHat have disabled the necessary ACPI support in their uniprocessor kernels due to issues seen with particular laptop models. Therefore to use Oprofile on a uniprocessor machine you will either have to boot the SMP kernel or recompile the uniprocessor kernel with the necessary ACPI support.

To study cache performance, choose the relevant counter for the chip you are working with, if we are looking at cache miss events then we would need to ascertain which counters to use. These counters allow the study of both data and instruction cache misses. A list of the available counters can be obtained using the following command:

# opcontrol –list-events

See the following reference for the chip you wish to profile:

Pentium IV http://oprofile.sourceforge.net/docs/intel-P4-events.php3
Pentium IV HT http://oprofile.sourceforge.net/docs/intel-P4-ht2-events.php3
AMD Athlon http://oprofile.sourceforge.net/docs/amd-athlon-events.php3
AMD 64 http://oprofile.sourceforge.net/docs/amd-hammer-events.php3

To start the Oprofile collection session you must first initiate Oprofile with:

# opcontrol –init

This loads the associated kernel modules if they are not already loaded. Next Oprofile must be given a basic working configuration using:

# opcontrol --setup –separate=library --no-vmlinux --event= BSQ_CACHE_REFERENCE:1000:0x10f:1:1

This configures Oprofile for the Pentium IV and provides sampling of the data cache misses. The –event identifier breaks down as [Counter Name]:[Sample Overflow Value]:[Unit Mask]:[Profile Kernel Mode]:[Profile User Mode]. The –separate=library simply means that library samples should be attributed to the binary which invoked them.

Once your basic Oprofile configuration is in place then the sampling can be started using:

# opcontrol –start

And once everything of interest has finished then Oprofile sampling can be stopped and flushed using:

# opcontrol –stop ; opcontrol --dump

Once you have a valid set of samples these can be analysed using the opreport command. To begin to drill down on the data try:

# opreport –long-filenames

This will give a basic report, categorised by filename, showing which programs caused the cache misses that you sampled above. Once you have identified a file name of interest this can be studied in more depth using:

# opreport –l image:[path name from previous step]

This will then give you a report in which the various symbols in the executable are listed, along with details of how many cache misses they caused. If you are armed with the code and can recompile with the –g flag to gcc, which adds additional debug symbols, then try:

# opreport –l –g image:[path name from previous step]

This will give a report showing the symbols and misses annotated with the original source.

The above is an extremely basic introduction to Oprofile and does not do it anywhere near justice as a tool. To learn how to use Oprofile in anger try reading some of the links available at:

http://del.icio.us/KenRobson/oprofile

Oprofile has many uses beyond the narrow use presented here and any time spent learning the tool will reap benefits in a large variety of performance tuning situations. A more in-depth tutorial will be provided at a later date.

Valgrind
Valgrind has a number of modes, one of which can be used to model cache misses from the various CPU caches. Valgrind works by simulating the execution of your code and noting the cost of cache misses. This allows for a number or programming errors to be detected and also for the cache behaviour to be modelled. To study the cache usage of a program use:

# valgrind –tool=cachegrind [command to profile]

This will print some summary statistics and create a file called cachegrind.out.[process id]. The output file will contain line by line statistics and whilst human readable, it is probably best interpreted using the supplied cg_annotate program.

Again this is just a brief smattering of the capabilities of valgrind and more information can be obtained by reading some of the links available at:
http://del.icio.us/KenRobson/valgrind

Valgrind has a number of modes not even touched on here and any time spent learning the tool will reap benefits in a large variety of performance tuning and debugging scenarios. A more in-depth tutorial will be provided at a later date.

OpenSolaris
OpenSolaris also contains the usual set of performance analysis tools such as sar, iostat, vmstat, etc. However these tools tend to be a little more detailed under OpenSolaris (although the sar version does not have an equivalent to the Linux –n option which is a shame). OpenSolaris over time has also acquired number of tools and APIs that are very useful but do not seemed to have gained huge traction outside of Sun, for example TNF tracing, hardware performance counter APIs, busstat, lockstat and to an extent kstat. This is a shame because they are all excellent observability tools and add to the rich set of systemic information that can be acquired under OpenSolaris (TNF even has elements of deterministic profiling, although the lack of aggregation complicates things).

All of these tools were over shadowed by dtrace which provides not only systemic information but allows for deterministic profiling. This allows for new tools such as an I/O based version of top to be constructed. This allows for the gradual enumeration of problems, so we can transition from “I have lots of I/O” to “I have lots of I/O because of PID X”. However the real strength of dtrace is that we can look at process X in a deterministic manner and ascertain why it is generating large amounts of I/O.

Dtrace achieves this be providing production safe probe points in the kernel that are triggered and aggregated by scripts which are loaded into a VM inside the kernel. The VM provides certain safety features and portability advantages. The safety features and in-kernel providers approach allow dtrace to be ‘crash proof’, assuming there are no bugs in the VM or the providers. The safety features have been proven in real Production environments where dtrace has seen wide scale use.

Cputrack
Sun has gradually being revealing various types of hardware counters in its hardware architecture. The best sets of counters are available in its SPARC based hardware and it will be interesting to see if Sun extends this into the AMD/X86 based models. Aside from the standard CPU counters showing instruction and data cache misses, Sun revealed hardware counters in its bus architectures.

Whilst this section will predominantly look at the CPU counters it is worth taking a quick look at the HAR tool which is available from http://solarisinternals.com. This tool gives a great overview of the whole range of hardware counters in your system (assuming it is SPARC based). For general monitoring and focusing your sights on the appropriate problem domain, this is an invaluable tool.

Cputrack allows for the sampling of two concurrent counters. This means that results from cputrack must be multiplexed from multiple runs, if more than two counters are required. This can be challenging if the process under study either varies between runs or is short running. If the process under study varies largely between runs it is debatable what will be achieved by attempting to tune the process, as performance tuning makes assumptions about future behaviour based on past.

The documentation for cputrack indicates that support exists for both the SPARC and Intel architectures, it is not clear which IA chips are supported but the most common are documented along with their counters at:

Pentium IV http://oprofile.sourceforge.net/docs/intel-P4-events.php3
Pentium IV HT http://oprofile.sourceforge.net/docs/intel-P4-ht2-events.php3
AMD Athlon http://oprofile.sourceforge.net/docs/amd-athlon-events.php3
AMD 64 http://oprofile.sourceforge.net/docs/amd-hammer-events.php3

The SPARC architecture’s counters are documented at:

SPARC IV http://www.sun.com/processors/manuals/USIV_v1.0.pdf (Sec. 6)
SPARC III http://www.sun.com/processors/manuals/USIIIv2.pdf (Sec. 14)

Cputrack can be used to trace processes using the following generic command line:

# cputrack –c [CPU Counter] –t –T 1 –o [Output File] [Command To Analyse]

It can additionally it can be used to attach to a running process using the following generic command line:

# cputrack –c [CPU Counter] –t –T 1 –o [Output File] –p [Process ID]

Both of these invocations place their output into the file specified by [Output File]. The output will be a simple table listing the time, the process ID, the lightweight process ID, the number of cycles and the count for the selected counter during the sample period.

Whilst cputrack is a useful tool it is not as configurable or flexible as oprofile. It is not possible to separate out the various sample elements such as libraries, kernel routines, etc. However this lack of complexity makes cputrack a much simpler tool to use and it is only slightly more complex than using truss.

Further information and articles pertaining to cputrack can be found at:

http://del.icio.us/KenRobson/cputrack

This section has been a brief introduction to cputrack only. It is recommended that you read the associated information and experiment with cputrack to fully understand its use and feature set.

Sun Studio
Sun Studio builds upon the hardware counter facilities utilised by cputrack and provides automated performance tuning features. These features initially monitor the hardware counters, profiling the application and then use the output of the profiling process to apply major optimisations to the executed binary.

The first technique uses mapfiles to instruct the linker to re-organise binaries in memory to co-locate hot functions in an attempt to exploit spatial and temporal locality. This optimisation technique is useful for optimising at the function level. The following series of commands can be executed to use this optimisation technique with the Sun Studio compiler:

# cc -O -xF -o [Program Name] *.c
# collect [Program Name]
# er_print -mapfile [Program Name] [Program Name].map [Program Name].err
# cc -O -xF -M [Program Name].map -o [Program Name] *.c

The second technique that can be applied to improve cache performance is to optimise at the instruction level. This technique groups together frequently used instructions, again with the goal of exploiting spatial and temporal locality. The following series of commands can be executed to use this optimisation technique with the Sun Studio compiler:

# cc -O -xprofile=collect:[Program Name].profile -xipo -o [Program Name] *.c
# [Program Name]
# cc -O -xprofile=use:[Program Name].profile -xipo -o [Program Name] *.c

Mapfiles work well with functions, whilst profile feedback works well at the instruction level, so it seems common sense to want to optimise using both techniques. This is referred to as link optimisation and this technique goes further than the previous two optimisations. The technique identifies both hot code and hot instructions within functions but then allows, with a technique similar to mapfiles, for the hot instructions to be grouped together for optimum efficiency. This greatly enhances cache utilisation and efficiency. To utilise this technique compile as follows:

# cc -O -xprofile=collect:[Program Name].profile -xipo -o [Program Name] *.c
# [Program Name]
# cc -O -xprofile=use:[Program Name].profile -xipo -o [Program Name] *.c –xlinkopt

This will ensure that hot instruction areas are coalesced for maximum cache performance. The above is an extremely basic introduction to the subject of performance tuning code on OpenSolaris. Further information on this subject area is available at:

http://del.icio.us/KenRobson/opensolaris+performance+compiler

Hopefully at some point in the future I will be able to revisit this topic and provide more in-depth information and content.

Conclusions
In this article I have outlined the concepts of cache hierarchies and how they are fundamental to the performance of computers. I have then focused on performance tuning CPU caches and have explored tools available in both OpenSolaris and Linux for achieving this task. The tools I have focused most heavily on utilise the statistical counters available on an increasing range of hardware elements, as well as two profiling approaches.

These tools are just a selection of those available. In the future I will write up more detailed articles on the tools described here, as well as some of the other tools available such as dtrace, systemtap, vtune and the various linker options available.

For further information concerning performance tuning please see www.performaceguru.com. An audio version of this article is avaiable.


Saturday, September 10, 2005

 

Dtrace Exposes Perl Execution...

After its recent success with other dynamic languages, such as Ruby and Java, it seems dtrace can now profile Perl code. The work done by Alan Burlinson is embryonic and does not catch a small amount of possible function entry and exit points but nonetheless it is a very valuable piece of work. I would imagine it will not take long for people to build on Alan's work and solve the issues he raises in his post.

Alan's post about how he went about instrumenting Perl is located at:

Alan's Blog Entry

Hopefully the various dynamic language observability efforts, made in relation to dtrace, will ensure that in future the intepreters used for these languages will be designed with observability in mind and will therefore be simpler and more efficient to instrument.


Thursday, September 08, 2005

 

Added Cacti To Tools Page...

Added Cacti to the tools page of my website:

performanceguru.com

The description I gave it went as follows:

"Cacti provides a PHP front-end to the rrdtool plotting package. This package also runs on Solaris but I have listed it under Linux as all of the dependencies are already satisfied in Linux (although this will probably even out with the advent of OpenSolaris). In functionality terms Cacti is very similar Orca except it is not limited to the SE Toolkit as a data source. It can be viewed as a template driven data graphing system."

del.icio.us Tags:

 

Set Of Systemtap Install Instructions...

Someone beat me to it. Kevin Stafford produced a summary of the steps required to get Systemtap running on Fedora Core 4.

System Tap Install Summary

The project seems to be progressing well with a lot of the basic building blocks beginning to shape up. Attention is starting to turn to developing the the Tapsets (sets of primitives that will be blessed into Guru Mode so they can contain embedded C if required) which will form the basis of the scripting environment and will play a similar role to script writers as dtrace providers (NB - this is in the sense of data providers rather than implying any architectural similarity).


 

Xen Article In Linux Journal...

The Linux Journal have a very good intridutory article covering Xen. Well worth a read:

Article Source

del.icio.us Tags:

Tuesday, September 06, 2005

 

Systemic Versus Application Profiling

Introduction
One of the most common issues seen in the performance field is the attempted use of system wide aggregated statistics to diagnose application performance issues. When a production system performance issue is reported resist the temptation to study system wide statistics and extrapolate, particularly if they are historic smoothed averages with no indication of standard deviation (e.g. historic sar data taken by five minute samples in cron). It is usually far more fruitful to identify the thread of execution which is the basis of the complaint (e.g. the report is running slowly, my screen is updating slowly, etc.). Once the performance complaint has been translated into a thread of execution then a deterministic profiling method should be used to produce a performance profile of the problematic thread of execution.

Truss & Strace
The simplest tools for doing this are truss (Solaris) and strace (Linux). Later versions of both tools can also provide information about user space library calls. Caution should be exercised when tracing user libraries, as the information can be misleading, as these tools do not separate calls between library routines (i.e. if one library calls another the time will be allotted only to the first library call). To greater and lesser extents these tools can also introduce the Heisenberg effect (effecting what you are observing), to get an indication of the effect of the tool try timing an application run with and without the tool. Most of this tools use interposing, breakpoints or watchpoints, all of which can have a noticeable effect on execution flow, as they either single step code, introduce significant additional code in the interposing path or require complex memory checks (this may change in the future as most modern processors have specialised debug registers for breakpoints and watchpoints).

Dtrace & Systemtap

Whilst these tools are useful far more useful tools have been delivered or are in development, which help with this task. Sun Microsystems have delivered the excellent and proven dtrace tool in their Solaris operating environment. Dtrace provides a scripting language that activates dynamic probes in both user space and kernel space allowing for the exact flow of execution to be traced from user space into kernel space. This is exactly what is required for Production performance issues. Currently dtrace is being developed in many interesting areas, including the ability to trace various dynamic languages (Java bytecode, Python, Perl, etc) and interesting futures such as the integration of hardware counters into the dtrace framework. Various vendors are collaborating on a similar tool for Linux called Systemtap but the project is in its infancy, although it is evolving quickly, it has yet to be proven Production safe and this is likely to be its biggest challenge. Dtrace and Systemtap are referred to as deterministic profilers meaning they instrument each function call and are extremely lightweight probes when enabled, allowing for almost the exact execution path to be profiled.

Oprofile

Linux has a more mature tool known as Oprofile that can be used safely on Production systems. Oprofile is a statistical profiler, meaning it profiles by taking samples. Various hardware counters can be set, each with a sample rate. When the counter reaches the set sample point an NMI (Non Maskable Interrupt) is generated and the handler samples the program counter. The use of NMIs means that code with interrupts disabled can be accurately profiled. These samples are the collected in user space and can be used to generate useful reports. To profile a thread of execution a time source is generally sampled and this provides a statistical sample of what each thread of execution on the system was doing, this data can then be manipulated to isolate the thread of execution of interest using the reporting tools. Currently Oprofile is being extended to begin to understand dynamic code such as Java. The only issue with Oprofile is that it does not easily identify sleep states such as waiting for I/O or locks (unless they are spinning). In spite of this drawback Oprofile is by far the best application profiling tool available in Linux, particularly if your application is CPU bound.

Conclusions

Whilst Linux has a mature tool in Oprofile and a promising tool in Systemtap, Solaris probably wins the observability stakes currently due to dtrace and its proven stability/track record, as well as its future roadmap. Finally, to conclude, when are systemic statistics of most use (i.e. sar and friends), mainly for pre-production tuning and during benchmarking of performance simulations where multiple system elements must be optimised together, as well as for capacity planning exercises.

del.icio.us Tags:

Monday, September 05, 2005

 

Tools Page Update...

Have created a new Network Tools section on tools page and started to collect useful network tools. For more details please see:

performanceguru.com

Suggestion of omissions and additions most welcome.

del.icio.us Tags:

 

Systemtap Review

I noticed the following review of Systemtap:

Link To Review

I did not think it was entirely fair. I lurk on the Systemtap mailing list, as I feel it is an important performance project for Linux. I will quickly go through some of the major issue James raises:

Distribution War

I have not noticed a distribution war and I am not sure what this comment was based on. It is true that most of the Developers are using Fedora Core 4 with certain Rawhide elements included in their environment. I assume this is a reference to the following email sent to James by Ulrich Drepper:

Systemtap Mail Archive

This is hardly a distribution war rather a defence of the elfutils package that Ulrich wrote and now supports. All Ulrich pointed out was that the distribution James was using had a very old copy of binutils that needed updating, he was not told to move to a RedHat release or discouraged from using Debian.

If Its Not RPM – Don’t Deal With It

I believe this comment revolves around an email exchange with Roland McGrath and Ulrich Drepper:

Systemtap Mail Archive

In response to further problems experienced by James, it transpired he had used a source tar ball supplied with the Systemtap CVS drop cinocrrectly. Roland McGrath helped James out and Ulrich Drepper states that people should not distribute source tar balls (as the Systemtap was doing in this case with elfutils) but should distribute source RPMs so that developers and maintainers did not waste their time dealing with build logic.

Lack of Net Install

The use of Kickstart is clearly covered in the base documentation for RedHat releases in much the way that you can derive how to use Jumpstart for OpenSolaris from the official Solaris documentation. This documentation is located at:

RHEL Manuals

Or if you prefer (or don’t like RedHat, as James seems not to) you can get help from the how-to section of Linux Documentation Project:

TLDP HOW-TO's

After Main Tries I Got A CVS Drop to Build

The CVS repository is the head of the Development tree. It is in a state of flux, as it is a Development tree and depending what hour of the day you grab your tree depends on how stable it might be. This is normal and if you wish a stable version of Systemtap then there is a version in the Rawhide tree that runs against the original FC4 kernel, so you can take a base FC4 install and apply this systemtap RPM (if I remember correctly you will need to move to version 1.11 of elfutils and it must be that exact version).

Conclusions

I sympathise with James but Systemtap is currently under Development and if you wish to download the CVS repository then it should run (with varying bugs) against any kernel that includes RelayFS, kprobes & kretprobes. In addition you will need the latest version of elfutils which is usually available from the Rawhide repository, if not email the list and Ulrich Drepper will usually spin that latest version to the repository for the external project participants.

The suggestion for a HOW-TO is a fair one. Once I conclude my current whitepaper work I will see if I have time to pull one together.


Sunday, September 04, 2005

 

Tools Page Update...

Have started to pull together a list of reasonable performance tools for Solaris and Linux. Updated the tools page on my website with these - please see:

http://www.performanceguru.com

Have almost completed the caching whitepaper and hope to have that up in the next few days.

Archives

November 2004   June 2005   July 2005   August 2005   September 2005   October 2005   November 2005   December 2005   March 2006   April 2006  

This page is powered by Blogger. Isn't yours?