Performance Bound Modeling

Overview

Our approach to performance modeling involves combining application signatures (characterizing the memory and computational pattern of an application) with machine signatures (characterizing the fundamental aspects of a machine independent of the applications executing on it) to establish performance bounds, namely, the upper limit in performance that can be expected for a given application on a given system. Performance bounds are useful for several reasons. They provide information about the efficiency of an implementation, where efficiency is defined to be the ratio of achieved performance to the performance bound. This provides guidance on how best to improve performance. In the case where efficiency is low, performance-engineering tools can be used to improve the implementation. In the case where efficiency is high but performance is unacceptable, a change of algorithms or architectures can increase the bounds. For example, in the case of an iterative linear solver limited by memory bandwidth, performance could be improved through the use of a less bandwidth-intensive algorithm, such as a block Krylov method. Alternatively, the memory in the nodes of a Linux cluster might be upgraded in order to increase the effective bandwidth.

Estimating Performance Bounds

PBound is a tool being developed to automatically insert performance estimation code into a given C input. It can thus be used by developers of numerical applications to determine the upper bound on the performance of their code as well as identify potential bottlenecks that may be preventing the application from achieving that bound.

PBound is available for trial use. The developers envision that it will be provided as an online service in the near future.

Obtaining the source code

To anonymously check out the source code release (v0.1.0) from the Subversion repository (recommended):

svn co https://svn.mcs.anl.gov/repos/performance/pbound/branches/pbound_0_1_0

To anonymously check out the development source code from the Subversion repository:

svn co https://svn.mcs.anl.gov/repos/performance/pbound

If you are interested in contributing to the development, please contact Sri Hari Krishna Narayanan.

Configuration and build

  • Release Version (pbound_0_1_0) Prerequisites:  Boost, the  ROSE compiler infrastructure and Python.
  • The build system uses GNU autotools. Here are example configuration and build steps for the release version:
cd /path/to/pbound/source/dir
./autogen.sh
./configure --with-rose=/path/to/rose --with-boost=/path/to/boost --prefix=/path/to/install/dir
make
make install
  • Here are example configuration and build steps for the development version:
hg clone http://mercurial.mcs.anl.gov/ad/omega-project/
cd omega-project
Edit line 82 of Makefile.config to set DESTDIR
make depend
make libomega.a
make libcode_gen.a
make install

hg clone http://mercurial.mcs.anl.gov/ad/RoseFE_OpenAnalysis/
cd RoseFE_OpenAnalysis
./configure --with-boost=/path/to/boost CXX=/path/to/same/gcc/as/used/for/rose \
            CXXFLAGS="-g -O0"  --prefix=/chose-one/RoseFE_OpenAnalysis
make 
make install

hg clone http://mercurial.mcs.anl.gov/ad/Rose2OA/
cd Rose2OA/
./configure --with-rose=/path/to/rose --with-boost=/path/to/boost --with-openanalysis=/path/to/openanalysis\
            --enable-targetfortran CXX=/path/to/same/g++/as/used/for/rose CC=/path/to/same/gcc/as/used/for/rose \
            CCFLAGS=-g  CXXFLAGS=-g --prefix=/chose-one/Rose2OA/
make 
make install


cd /path/to/pbound/dir
./autogen.sh
./configure --with-rose=/path/to/rose --with-boost=/path/to/boost --with-openanalysis=/path/to/openanalysis \
            --with-rose2oa=/path/to/rose2oa/ --with-omegalibrary=/path/to/omegalibrary \
             CXX=/path/to/same/g++/as/used/for/rose CC=/path/to/same/gcc/as/used/for/rose \
             CCFLAGS=-g  CXXFLAGS=-g --prefix=/path/to/install/dir
make
make install
  • If successful, you should now have an executable /path/to/install/dir/bin/pbound available.

Usage Instructions

Command Line Options

PBound requires an input source file and an init file to operate. If an init file is not specified, then a default file /path/to/install/dir/bin/pbound/share/pbound.init is used. Its usage is:

/path/to/install/dir/bin/pbound  [-pbound:(m|minmax_switch) value]  [-pbound:(d|debugMsgLevel) value]  
           [-pbound:(w|warningMsgLevel) value]  [-pbound:(i|infoMsgLevel) value]  
           [-pbound:(s|sourceCommentLevel) value]  [-pbound:(l|loopPerfMacroLevel) value]
           [-pbound:(c|configFile) value] input_filename

The simplest way to get started is to use the default options :

/path/to/install/dir/bin/pbound  input_filename

Generating Bounds

Assuming that the input is source.c, PBound will create an output file called source_perf.c with bounds information inserted into it. Consider the file below to be source.c.

#include <stdlib.h>
#define N 2000
void stream( double * a, double *b, double *c, double scalar)
{
  int j;
  for (j=0; j<N; j++) {
    a[j] = b[j]+scalar*c[j];
  }
}

int main()
{
  double *a, *b, *c, scalar;
  a = (double * ) malloc(N * sizeof(double));
  b = (double * ) malloc(N * sizeof(double));
  c = (double * ) malloc(N * sizeof(double));
  stream(a,b,c,scalar);
  return 0;
}

Assume that PBound is invoked using the command below. This particular invocation uses the default init file and specifies using the -pbound:s option that slicing should not be performed.

/path/to/install/dir/bin/pbound -pbound:s 0 source.c 

PBound creates two files - source_perf.c and source.c.pbout. The file source.c.pbout is currently not expected to meaningful for all execution scenarios. The file source_perf.c is shown below. In addition to the original code, it contains statements to include the header file pbound_list.h and the function call pboundLogInsert inside preprocessor directives. pboundLogInsert's arguments contain the bounds expressions generated by PBound.

#include "pbound_list.h"
#include <stdlib.h>
#define N 2000

void stream(double *a,double *b,double *c,double scalar)
{
  int j;
  for (j = 0; j < 2000; j++) {
    a[j] = ((b[j]) + (scalar * (c[j])));
  }
#ifdef pbound_log
  pboundLogInsert("source.c@6@3",1,0,2 * (2000 / 2) + 1,2000 / 2,2 * 2000 + 1,2000 / 2);
#endif
}

int main()
{
  double *a;
  double *b;
  double *c;
  double scalar;
  a = ((double *)(malloc((((2000) * (sizeof(double )))))));
  b = ((double *)(malloc((((2000) * (sizeof(double )))))));
  c = ((double *)(malloc((((2000) * (sizeof(double )))))));
  stream(a,b,c,scalar);
  return 0;
}

Modifying Generated Code

In order to extract the information from the function call(s) to pboundLogInsert, the following snippet of code should be inserted manually into the generated code just before the return statement(s) in the function int main(). It is anticipated that this manual operation will be removed in future versions of PBound.

#ifdef pbound_log
pbound_list_print_list();
#endif

The modified generated code is given below.

#include "pbound_list.h"
#include <stdlib.h>
#define N 2000

void stream(double *a,double *b,double *c,double scalar)
{
  int j;
  for (j = 0; j < 2000; j++) {
    a[j] = ((b[j]) + (scalar * (c[j])));
  }
#ifdef pbound_log
  pboundLogInsert("source.c@6@3",1,0,2 * (2000 / 2) + 1,2000 / 2,2 * 2000 + 1,2000 / 2);
#endif
}

int main()
{
  double *a;
  double *b;
  double *c;
  double scalar;
  a = ((double *)(malloc((((2000) * (sizeof(double )))))));
  b = ((double *)(malloc((((2000) * (sizeof(double )))))));
  c = ((double *)(malloc((((2000) * (sizeof(double )))))));
  stream(a,b,c,scalar);
#ifdef pbound_log
  pbound_list_print_list();
#endif
  return 0;
}

Compiling Generated Code

  • Add /path/to/install/dir/lib/ to LD_LIBRARY_PATH using:
setenv LD_LIBRARY_PATH /path/to/install/dir/lib/:$LD_LIBRARY_PATH

or

export LD_LIBRARY_PATH=/path/to/install/dir/lib/:$LD_LIBRARY_PATH
  • The output code needs the libraries and header files provided in /path/to/install/dir/lib/ and /path/to/install/dir/include/ respectively. Compile using :
gcc source_perf.c -o source_perf.exe -I/path/to/install/dir/include/ -Dpbound_log -lpbound -L/path/to/install/dir/lib/

Obtaining Bounds

The bounds are output by running the executable as shown below.

./source_perf.exe

The output lists the number of loads, stores and operations that should be performed the application. This format of output was chosen to be easily integrated with other software that may use it. Other formats may be available in the future.

Serial No., Location, Integer Loads (bytes), Integer Stores (bytes), Floating Point Loads (bytes), Floating Point Stores (bytes), IOPs, FLOPs
0,/Users/snarayan/Sources/pbound/examples/source.c@6@3,1,0,2001,1000,4001,1000

See the paper in the Documents portion of the page on how to use these values.

Advanced Usage

The behavior of PBound can be modified by the contents of the init file and command line options.

Init File

The following is the default int file '/path/to/install/dir/bin/pbound/share/pbound.init'. An alternate init file can be specified using the command line option [-pbound:(c|configFile) value]. It contains several options that specify the architecture that source.c will be executed on as well as specifications to prevent PBound from analyzing certain files/functions. Not all architectural values are being used in the current version of PBound, but reflect a subset of the possible features PBound will have.

[SYSTEM]
	python = "/usr/bin/python"

[GENERAL]
	include_dirs = /usr/include/:/usr/local/include/:/usr/lib/:/usr/local/lib/

[ARCHITECTURAL_PARAMETERS]
        clock_rate =    2.8     ; in GHz (float)
        l1_read_latency =  4    ; in cycles (int)
        l1_write_latency = 4    ; in cycles (int)
        l2_read_latency =  12   ; in cycles (int)
        l2_write_latency = 12   ; in cycles (int)
        l3_read_latency =  28   ; in cycles (int)
        l3_write_latency = 28   ; in cycles (int)
        mem_read_latency = 86   ; in cycles (int)
        mem_write_latency =86   ; in cycles (int)
        mem_read_bandwidth =  3585 ; in cycles (int)
        mem_write_bandwidth = 2715 ; in cycles (int)
        vector_ops =   true     ; boolean default is true
        vector_length   =  2    ; integer default is 2
        vector_register_length = 8 ; integer default is 8
        fused_loads     =  2 ;   integer default is 2 - means that 2 loads can be fused         
        fused_stores    =  2 ;   integer default is 2 - means that 2 stores can be fused                                                          
; add things like cache size, line details etc, write back, write through
;number of cores perhaps
; later perhaps I/O information?
        

[INTRINSIC_LATENCY]
; Every entry in this section must have a corresponding entry in INTRINSIC_ISSUE_RATE                       
      fp_add =  1 ;  in cycles (int)
      fp_sub =  1 ;  in cycles (int)
      fp_mul =  1 ;  in cycles (int)
      fp_div =  1 ;  in cycles (int)    
      fp_mul_add =  1 ;  in cycles (int)
      fp_mul_sub =  1 ;  in cycles (int)      
      fp_div_add =  1 ;  in cycles (int)
      fp_div_sub =  1 ;  in cycles (int)      
      	
[INTRINSIC_ISSUE_RATE]
; Every entry in this section must have a corresponding entry in INTRINSIC_LATENCY 
      fp_add =  1 ;  in issues per cycle (int)
      fp_sub =  1 ;  in issues per cycle (int)
      fp_mul =  1 ;  in issues per cycle (int)
      fp_div =  1 ;  in issues per cycle (int)                   
      fp_mul_add =  1 ;  in issues per cycle (int)
      fp_mul_sub =  1 ;  in issues per cycle (int)      
      fp_div_add =  1 ;  in issues per cycle (int)
      fp_div_sub =  1 ;  in issues per cycle (int)

[FUSABLE_OPERATIONS]
;Every entry here must have an entry in INTRISIC_ISSUE_RATE and INTRINSIC_LATENCY
;Entries are case insensitive
;The '_' is necessary to separate the ops that are fused.

      fp_mul_add
      fp_mul_sub
      fp_div_add
      fp_div_sub
            
[INACTIVE_FUNCTIONS]
	exit
	creat
	open
	close
	write
	printf
	fprintf
	vfprintf
	vsprintf
	scanf
	sscanf
	fscanf
	fclose
	fflush
	va_start
	va_arg
	va_end
	gnu_dev_major
	gnu_dev_minor
	gnu_dev_makedev

Modifying Architectural Parameters

The following rules modify the architecture that is presented to PBound which in turn affects the bounds generated by PBound.

Modifying Read and Write instruction counts

This is performed by modifying the values for fused_loads and fused_stores. A value of 1 means that the architecture does not support fused loads and stores.

Modifying IOP/FLOP counts

SIMD is supported if the value of vector_ops is true. The vector_length value adjusts how many simultaneous executions are performed for each SIMD instruction.

The list of fusable/intrinsic operations supported by the architecture are listed in the section FUSABLE_OPERATIONS. Fusing instructions reduces the IOP/FLOP count by 1 for each pair of fused instructions.

Command line options

The command-line option -pbound:(slice|s) if set to true will cause slicing of the input code. So that it leaves only calls to pboundLogInsert() in the generated output. This allows the generated code to be executed without the overhead of the computation being analyzed.

Documents

S. H. K. Narayanan, B. Norris, and P. D. Hovland, "Generating Performance Bounds from Source Code,"  Preprint ANL/MCS-P1685-1009, October 2009.

Doxygen Documentation