wiki:Orio/TuneSpecs

Version 16 (modified by hartono, 15 years ago) (diff)

--

Performance Tuning Specifications of Orio

In addition to the quick start guide to using Orio's performance tuner presented in the Orio's main webpage, the following documentation explains more details on tuning specifications so that users can fully benefit the automatic tuning feature of Orio.

Example

Below is a concrete illustration of how the tuning specifications of Orio look like.

# total number of registers
let NUM_REGS = 128;               

# 64MB L1 cache
let L1_CACHE_SIZE = 64*(2**20);   

def build {                                                             
  arg command = 'icc';                                                                                 
  arg options = '-fast -parallel';                                                                                 
}    

def performance_params {
  param TileSize1[] = [1,32,64,128,256,512];
  param TileSize2[] = [1,32,64,128,256,512];
  param UnrollFactor1[] = range(1,33);
  param UnrollFactor2[] = range(1,33);
  constraint RegisterCapacity = UnrollFactor1 * UnrollFactor2 * 9 <= NUM_REGS;
  constraint L1Tiling = TileSize1 * TileSize2 <= L1_CACHE_SIZE;
}

def input_params {
  let SIZES = [100,1000,2000,4000,8000];
  param M[] = SIZES;
  param N[] = SIZES;
  constraint SquareShape = M == N;
}

def input_vars {
  decl dynamic double X[M] = 0;
  decl dynamic double Y[N] = random
  decl static double A[M][N] = random;
  decl double C = random;
}

Structure of Tuning Specifications

The tuning specifications of Orio simply consist of a sequence of definition statements. Every definition statement contains a series of auxiliary statements, which can be categorized into five different types of statements as follows.

  1. Let statement has the main purpose of storing a temporary data into a variable that may be reused multiple times by other successive statements. To be noted that the location of a let statement need not be inside the body of a definition statement, as seen in the above example.
  2. Argument statement is used to collect specific information from the Orio user about the pertinent tuning components. One example shown above is the command and options arguments (in the build definition), of which role is to tell Orio about how to compile and execute the optimized code.
  3. Parameter statement is used to assign a range of values to the tuning parameters, which can be either performance parameters or input problem parameters. The symbol [] must be placed after the parameter name to indicate that the parameter has multiple values to be considered.
  4. Constraint statement aims primarily to prune off uninteresting portion of the space of parameter values so that the search is concentrated on the search space highly possible to yield high quality solutions. Some examples are the RegisterCapacity and L1Tiling constraints. Moreover, constraint statement also allows users to define the shape of the input arrays such as the SquareShape constraint, which can be found in the earlier example.
  5. Declaration statement informs the performance testing driver about all input scalars and arrays required to be declared and initialized. It is to be noted that the static and dynamic keywords provide guidance to the driver on how it should allocate memory space for the declared arrays.

Declarations and Initializations of Input Variables

As just mentioned before, all input variables involved in the core computation must be specified in the input_vars definition statement so that the performance testing driver can construct code for both the declarations and the initializations of the input variables. However, declarations and initializations of input variables can turn complicated, especially for multidimensional arrays with unique properties such as upper/lower triangular matrices and anti-symmetric matrices. As a consequence, Orio offers three alternatives to its users on how input variables can be declared and initialized accurately.

  1. Both declarations and initializations are generated by the driver.
    def input_vars {
      decl static double X[N][N] = 0;
    }
    
  2. Declarations are generated by the driver, whereas initializations are written by the user. To be noted that all the declaration statements must have no initial assigned values.
    def input_vars {
      decl static double X[N][N];
      arg init_file = 'init_code.c';
    }
    
  3. Both declarations and initializations are written by the user.
    def input_vars {
      arg decl_file = 'decl_code.h';
      arg init_file = 'init_code.c';
    }
    

The following is the content of the decl_code.h file.

double X[N][N];

And the code of the init_code.c file is displayed below.

void init_input_vars() {
  int i,j;
  for (i=0; i<=N-1; i++)
    for (j=0; j<=N-1; j++)
      if (i < j)
        X[i][j] = (i+j)%10 + 1;
      else if (i == j)
        X[i][j] = 1;
      else
        X[i][j] = 0;
}

One prerequisite of a user-provided initialization program is that the input variables’ initializations must be enclosed inside a function named init_input_vars; otherwise, Orio will report an error message.

Overriding the Performance-Testing Skeleton Code

Orio offers its users a capability of modifying how to precisely carry out the performance testing process. But first, let us look at the basic C skeleton code used by Orio to measure the performance of the optimized code.

#include <stdio.h> 
#include <stdlib.h> 
#include <sys/time.h> 

/*@ global @*/ 

double rtclock() 
{ 
  struct timezone tzp; 
  struct timeval tp; 
  int stat; 
  gettimeofday(&tp, &tzp); 
  return (tp.tv_sec + tp.tv_usec*1.0e-6); 
} 

int main() 
{ 
  /*@ prologue @*/ 

  double orio_t_start=0, orio_t_end=0, orio_t_total=0; 
  int orio_i; 

  for (orio_i=0; orio_i<REPS; orio_i++) 
  { 
    orio_t_start = rtclock(); 
     
    /*@ tested code @*/ 

    orio_t_end = rtclock(); 
    orio_t_total += orio_t_end - orio_t_start; 
  } 
   
  orio_t_total = orio_t_total / REPS; 
  printf("%f\n", orio_t_total); 
   
  /*@ epilogue @*/ 

  return 0; 
}

Four important tags exist in the skeleton code, each with its own special purpose. The '/*@ global @*/' tag is used by the driver to place global declarations such as input variable declarations and the initialization function of the input arrays. The ‘/*@ prologue @*/’ tag is the spot where to dynamically allocate memory space for input arrays and to call the function that initializes all input variables. The ‘/*@ tested code @*/’ tag designates the location of the code tested for its performance. And the last tag ‘/*@ epilogue @*/’ remains unused until now, but it is kept for potential development in the future.

To change the performance testing process, Orio provides a way to its users to override the performance-testing skeleton code with their own skeleton code. This can be done by specifying the new user-written skeleton code in the tuning specifications as described below. The new skeleton code must also have the four important tags mentioned before, or the performance testing driver will complain.

def performance_test_code { 
  arg skeleton_code_file = 'my_skeleton_code.c';  
} 

One example of the advantages of overriding the performance-testing skeleton code is to have a finer grained performance measurement, that is by counting the number of processor clock cycles instead of calculating the elapsed time (through the standard C’s gettimeofday function). Of course, to do this requires an installation of a performance counter tool such as TAO or PAPI. So, Orio users need to revise the skeleton code by inserting code used for initializing performance counter library and for setting event counters.