Performance Bound Modeling
Overview
Our approach to performance modeling involves combining application signatures (characterizing the memory and computational pattern of an application) with machine signatures (characterizing the fundamental aspects of a machine independent of the applications executing on it) to establish performance bounds, namely, the upper limit in performance that can be expected for a given application on a given system. Performance bounds are useful for several reasons. They provide information about the efficiency of an implementation, where efficiency is defined to be the ratio of achieved performance to the performance bound. This provides guidance on how best to improve performance. In the case where efficiency is low, performance-engineering tools can be used to improve the implementation. In the case where efficiency is high but performance is unacceptable, a change of algorithms or architectures can increase the bounds. For example, in the case of an iterative linear solver limited by memory bandwidth, performance could be improved through the use of a less bandwidth-intensive algorithm, such as a block Krylov method. Alternatively, the memory in the nodes of a Linux cluster might be upgraded in order to increase the effective bandwidth.
Estimating Performance Bounds
PBound is a tool being developed to automatically insert performance estimation code into a given C input. It can thus be used by developers of numerical applications to determine the upper bound on the performance of their code as well as identify potential bottlenecks that may be preventing the application from achieving that bound.
PBound is available for trial use. The developers envision that it will be provided as an online service in the near future.
Obtaining the source code
To anonymously check out the source code release (v0.1.0) from the Subversion repository (recommended):
svn co https://svn.mcs.anl.gov/repos/performance/pbound/branches/pbound_0_1_0
To anonymously check out the development source code from the Subversion repository:
svn co https://svn.mcs.anl.gov/repos/performance/pbound
If you are interested in contributing to the development, please contact Sri Hari Krishna Narayanan.
Configuration and build
- Development Version Prerequisites: Boost, the ROSE compiler infrastructure, OpenAnalysis, Rose2OA, OmegaLibrary, and Python with the SymPy Library. See OmegaLibrary Installation Notes and Rose Installation Notes as well.
- The build system uses GNU autotools. Here are example configuration and build steps for the release version:
cd /path/to/pbound/source/dir ./autogen.sh ./configure --with-rose=/path/to/rose --with-boost=/path/to/boost --prefix=/path/to/install/dir make make install
- Here are example configuration and build steps for the development version:
hg clone http://mercurial.mcs.anl.gov/ad/omega-project/ cd omega-project Edit line 82 of Makefile.config to set DESTDIR make depend make libomega.a make libcode_gen.a make install hg clone http://mercurial.mcs.anl.gov/ad/RoseFE_OpenAnalysis/ cd RoseFE_OpenAnalysis ./configure --with-boost=/path/to/boost CXX=/path/to/same/gcc/as/used/for/rose \ CXXFLAGS="-g -O0" --prefix=/chose-one/RoseFE_OpenAnalysis make make install hg clone http://mercurial.mcs.anl.gov/ad/Rose2OA/ cd Rose2OA/ ./configure --with-rose=/path/to/rose --with-boost=/path/to/boost --with-openanalysis=/path/to/openanalysis\ --enable-targetfortran CXX=/path/to/same/g++/as/used/for/rose CC=/path/to/same/gcc/as/used/for/rose \ CCFLAGS=-g CXXFLAGS=-g --prefix=/chose-one/Rose2OA/ make make install cd /path/to/pbound/dir ./autogen.sh ./configure --with-rose=/path/to/rose --with-boost=/path/to/boost --with-openanalysis=/path/to/openanalysis \ --with-rose2oa=/path/to/rose2oa/ --with-omegalibrary=/path/to/omegalibrary \ CXX=/path/to/same/g++/as/used/for/rose CC=/path/to/same/gcc/as/used/for/rose \ CCFLAGS=-g CXXFLAGS=-g --prefix=/path/to/install/dir make make install
- If successful, you should now have an executable /path/to/install/dir/bin/pbound available.
Usage Instructions
Command Line Options
PBound requires an input source file and an init file to operate. If an init file is not specified, then a default file /path/to/install/dir/bin/pbound/share/pbound.init is used. Its usage is:
/path/to/install/dir/bin/pbound [-pbound:(m|minmax_switch) value] [-pbound:(d|debugMsgLevel) value] [-pbound:(w|warningMsgLevel) value] [-pbound:(i|infoMsgLevel) value] [-pbound:(s|sourceCommentLevel) value] [-pbound:(l|loopPerfMacroLevel) value] [-pbound:(c|configFile) value] input_filename
The simplest way to get started is to use the default options :
/path/to/install/dir/bin/pbound input_filename
Generating Bounds
Assuming that the input is source.c, PBound will create an output file called source_perf.c with bounds information inserted into it. Consider the file below to be source.c.
#include <stdlib.h> #define N 2000 void stream( double * a, double *b, double *c, double scalar) { int j; for (j=0; j<N; j++) { a[j] = b[j]+scalar*c[j]; } } int main() { double *a, *b, *c, scalar; a = (double * ) malloc(N * sizeof(double)); b = (double * ) malloc(N * sizeof(double)); c = (double * ) malloc(N * sizeof(double)); stream(a,b,c,scalar); return 0; }
Assume that PBound is invoked using the command below. This particular invocation uses the default init file and specifies using the -pbound:s option that slicing should not be performed.
/path/to/install/dir/bin/pbound -pbound:s 0 source.c
PBound creates two files - source_perf.c and source.c.pbout. The file source.c.pbout is currently not expected to meaningful for all execution scenarios. The file source_perf.c is shown below. In addition to the original code, it contains statements to include the header file pbound_list.h and the function call pboundLogInsert inside preprocessor directives. pboundLogInsert's arguments contain the bounds expressions generated by PBound.
#include "pbound_list.h" #include <stdlib.h> #define N 2000 void stream(double *a,double *b,double *c,double scalar) { int j; for (j = 0; j < 2000; j++) { a[j] = ((b[j]) + (scalar * (c[j]))); } #ifdef pbound_log pboundLogInsert("[email protected]@3",1,0,2 * (2000 / 2) + 1,2000 / 2,2 * 2000 + 1,2000 / 2); #endif } int main() { double *a; double *b; double *c; double scalar; a = ((double *)(malloc((((2000) * (sizeof(double ))))))); b = ((double *)(malloc((((2000) * (sizeof(double ))))))); c = ((double *)(malloc((((2000) * (sizeof(double ))))))); stream(a,b,c,scalar); return 0; }
Modifying Generated Code
In order to extract the information from the function call(s) to pboundLogInsert, the following snippet of code should be inserted manually into the generated code just before the return statement(s) in the function int main(). It is anticipated that this manual operation will be removed in future versions of PBound.
#ifdef pbound_log pbound_list_print_list(); #endif
The modified generated code is given below.
#include "pbound_list.h" #include <stdlib.h> #define N 2000 void stream(double *a,double *b,double *c,double scalar) { int j; for (j = 0; j < 2000; j++) { a[j] = ((b[j]) + (scalar * (c[j]))); } #ifdef pbound_log pboundLogInsert("[email protected]@3",1,0,2 * (2000 / 2) + 1,2000 / 2,2 * 2000 + 1,2000 / 2); #endif } int main() { double *a; double *b; double *c; double scalar; a = ((double *)(malloc((((2000) * (sizeof(double ))))))); b = ((double *)(malloc((((2000) * (sizeof(double ))))))); c = ((double *)(malloc((((2000) * (sizeof(double ))))))); stream(a,b,c,scalar); #ifdef pbound_log pbound_list_print_list(); #endif return 0; }
Compiling Generated Code
- Add /path/to/install/dir/lib/ to LD_LIBRARY_PATH using:
setenv LD_LIBRARY_PATH /path/to/install/dir/lib/:$LD_LIBRARY_PATH
or
export LD_LIBRARY_PATH=/path/to/install/dir/lib/:$LD_LIBRARY_PATH
- The output code needs the libraries and header files provided in /path/to/install/dir/lib/ and /path/to/install/dir/include/ respectively. Compile using :
gcc source_perf.c -o source_perf.exe -I/path/to/install/dir/include/ -Dpbound_log -lpbound -L/path/to/install/dir/lib/
Obtaining Bounds
The bounds are output by running the executable as shown below.
./source_perf.exe
The output lists the number of loads, stores and operations that should be performed the application. This format of output was chosen to be easily integrated with other software that may use it. Other formats may be available in the future.
Serial No., Location, Integer Loads (bytes), Integer Stores (bytes), Floating Point Loads (bytes), Floating Point Stores (bytes), IOPs, FLOPs 0,/Users/snarayan/Sources/pbound/examples/[email protected]@3,1,0,2001,1000,4001,1000
See the paper in the Documents portion of the page on how to use these values.
Advanced Usage
The behavior of PBound can be modified by the contents of the init file and command line options.
Init File
The following is the default int file '/path/to/install/dir/bin/pbound/share/pbound.init'. An alternate init file can be specified using the command line option [-pbound:(c|configFile) value]. It contains several options that specify the architecture that source.c will be executed on as well as specifications to prevent PBound from analyzing certain files/functions. Not all architectural values are being used in the current version of PBound, but reflect a subset of the possible features PBound will have.
[SYSTEM] python = "/usr/bin/python" [GENERAL] include_dirs = /usr/include/:/usr/local/include/:/usr/lib/:/usr/local/lib/ [ARCHITECTURAL_PARAMETERS] clock_rate = 2.8 ; in GHz (float) l1_read_latency = 4 ; in cycles (int) l1_write_latency = 4 ; in cycles (int) l2_read_latency = 12 ; in cycles (int) l2_write_latency = 12 ; in cycles (int) l3_read_latency = 28 ; in cycles (int) l3_write_latency = 28 ; in cycles (int) mem_read_latency = 86 ; in cycles (int) mem_write_latency =86 ; in cycles (int) mem_read_bandwidth = 3585 ; in cycles (int) mem_write_bandwidth = 2715 ; in cycles (int) vector_ops = true ; boolean default is true vector_length = 2 ; integer default is 2 vector_register_length = 8 ; integer default is 8 fused_loads = 2 ; integer default is 2 - means that 2 loads can be fused fused_stores = 2 ; integer default is 2 - means that 2 stores can be fused ; add things like cache size, line details etc, write back, write through ;number of cores perhaps ; later perhaps I/O information? [INTRINSIC_LATENCY] ; Every entry in this section must have a corresponding entry in INTRINSIC_ISSUE_RATE fp_add = 1 ; in cycles (int) fp_sub = 1 ; in cycles (int) fp_mul = 1 ; in cycles (int) fp_div = 1 ; in cycles (int) fp_mul_add = 1 ; in cycles (int) fp_mul_sub = 1 ; in cycles (int) fp_div_add = 1 ; in cycles (int) fp_div_sub = 1 ; in cycles (int) [INTRINSIC_ISSUE_RATE] ; Every entry in this section must have a corresponding entry in INTRINSIC_LATENCY fp_add = 1 ; in issues per cycle (int) fp_sub = 1 ; in issues per cycle (int) fp_mul = 1 ; in issues per cycle (int) fp_div = 1 ; in issues per cycle (int) fp_mul_add = 1 ; in issues per cycle (int) fp_mul_sub = 1 ; in issues per cycle (int) fp_div_add = 1 ; in issues per cycle (int) fp_div_sub = 1 ; in issues per cycle (int) [FUSABLE_OPERATIONS] ;Every entry here must have an entry in INTRISIC_ISSUE_RATE and INTRINSIC_LATENCY ;Entries are case insensitive ;The '_' is necessary to separate the ops that are fused. fp_mul_add fp_mul_sub fp_div_add fp_div_sub [INACTIVE_FUNCTIONS] exit creat open close write printf fprintf vfprintf vsprintf scanf sscanf fscanf fclose fflush va_start va_arg va_end gnu_dev_major gnu_dev_minor gnu_dev_makedev
Modifying Architectural Parameters
The following rules modify the architecture that is presented to PBound which in turn affects the bounds generated by PBound.
Modifying Read and Write instruction counts
This is performed by modifying the values for fused_loads and fused_stores. A value of 1 means that the architecture does not support fused loads and stores.
Modifying IOP/FLOP counts
SIMD is supported if the value of vector_ops is true. The vector_length value adjusts how many simultaneous executions are performed for each SIMD instruction.
The list of fusable/intrinsic operations supported by the architecture are listed in the section FUSABLE_OPERATIONS. Fusing instructions reduces the IOP/FLOP count by 1 for each pair of fused instructions.
Command line options
The command-line option -pbound:(slice|s) if set to true will cause slicing of the input code. So that it leaves only calls to pboundLogInsert() in the generated output. This allows the generated code to be executed without the overhead of the computation being analyzed.
Documents
- H. K. Narayanan, B. Norris, and P. D. Hovland, "Generating Performance Bounds from Source Code," Preprint ANL/MCS-P1685-1009, October 2009.
[doxygen:index.html Doxygen Documentation]