Version 54 (modified by hartono, 15 years ago) (diff)



An annotation-based performance optimization tool


Orio is an extensible annotation system, implemented in Python, that aims at improving both performance and productivity by enabling software developers to insert annotations into their source code (in C/C++) that trigger a number of low-level performance optimizations on a specified code fragment. The tool generates many tuned versions of the same operation using different optimization parameters, and performs an empirical search for selecting the best among multiple optimized code variants.


Orio 0.0.1 (alpha)

Software Requirements

The requirement of instaling and using Orio is Python, which is widely available in any Linux/Unix? distribution. Orio has been tested successfully with Python 2.5.1 (or any later version).

Quick Install

The Orio installation follows the standard Python Module Distribution Utilities, or Disutils for short.

For users who want to quickly install Orio to the standard locations of third-party Python modules (requiring superuser privileges in a Unix system), the installation is straightforward as shown below.

% tar -xvzf orio.tar.gz
% cd orio
% python install

On a Unix platform, the above install command will normally put an orcc script in the /usr/bin location, and also create an orio module directory in the /usr/lib/python2.X/site-packages location.

To test whether Orio has been properly installed in your system, try to execute orcc command as given below as an example.

% orcc --help

description: compile shell for Orio

usage: orcc [options] <ifile> 
  <ifile>   input file containing the annotated code

  -h, --help                     display this message
  -o <file>, --output=<file>     place the output to <file>
  -v, --verbose                  verbosely show details of the results of the running program

In order to install Orio to an alternate location, users need to supply a base directory for the installation. For instance, the following command will install an orcc script under /home/username/bin, and also put an orio module under /home/username/lib/python/site-packages.

% tar -xvzf orio.tar.gz
% cd orio
% python install --prefix=/home/username

It is also important to ensure that the installed Orio module location is included in the PYTHONPATH environment variable. Similarly, users can optionally include the installed orcc script location in the PATH shell variable. To do this for the above example, the following two lines can be added in the .bashrc configuration file (assuming the user uses Bash shell, of course).

export PYTHONPATH=$PYTHONPATH:/home/username/lib/python/site-packages
export PATH=$PATH:/home/username/bin

Structure of Orio Framework

The picture shown below depicts at a high level the structure and the optimization process of the Orio framework.

As the simplest case, Orio can be used to speed up code performance by performing a source-to-source transformation such as loop unrolling and memory alignment optimizations. First, Orio takes a C/C++ code as input, which contains syntactically structured comments utilized to express various performance-tuning directives. Orio scans the annotated input code and extracts all annotation regions. Each annotation region is then passed to the code transformation modules for potential optimizations. Next, the code generator produces the final code with various optimizations being applied.

Furthermore, Orio can also be used as an automatic performance tuning tool. The system uses its code transformation modules and code generator to generate an optimized code version for each distinct combination of performance parameters. And then, the optimized code version is empirically executed and measured for its performance, which is subsequently compared to the performances of other previously tested code variants. After iteratively evaluating all code variants under consideration, the best-performing code is generated as the final output of Orio.

Because the space of all possible optimized code versions can be exponentially large, an exhaustive exploration of the search space becomes impractical. Therefore, several search heuristics are implemented in the search engine component to find a code variant with near-optimal performance.

The tuning specifications, written by users in the form of annotations, are parsed and used by Orio to guide the search and tuning process. These specifications include important information such as the used compilers, the search strategy, the program transformation parameters, the input data sizes, and so on.

User Guide

As previously discussed , Orio has two main functions: a source-to-source transformation tool and an automatic performance tuning tool. In the following subsections, simple examples are provided to offer users the quickest way to begin using Orio. But first, a brief introduction to the annotation language syntax is presented next.

Annotation Language Syntax

Orio annotation is denoted as a stylized C/C++ comment that starts with '/*@' and ends with @*/. For instance, the annotation /*@ end @*/ is used to indicate the end of an annotated code region. The following simple grammar illustrates the fundamental structure of Orio annotations.

<annotation-region> ::= <leader-annotation> <annotation-body> <trailer-annotation>
<leader-annotation> ::= /*@ begin <module-name> ( <module-body> ) @*/
<trailer-annotation> ::= /*@ end @*/

An annotation region consists of three main parts: leader annotation, annotation body, and trailer annotation. The annotation body can either be empty or contain C/C++ code that may include other nested annotation regions. A leader annotation contains the module name of the code transformation component that is loaded dynamically by Orio. A high level abstraction of the computation and the performance hints are coded in the module body inside the leader annotation and are used as input by the transformation module during the transformation and code generation phases. A trailer annotation, which has a fixed form (i.e. /*@ end @*/), closes an annotation region.

A concrete example of an annotated application code can be seen in the next subsection.

Using Orio as a Source-to-Source Code Transformation Tool

Orio has several code transformation module that have already been implemented and are ready to use. One of the transformation modules is loop unrolling, which is a loop optimization that aims to increase register reuse and to reduce branching instructions by combining instructions that are executed in multiple loop iterations into a single iteration. The below sample code demonstrates how to annotate an application code with a simple portable loop unrolling optimization, where the unroll factor used in this example is four. The original code to be optimized in this example is commonly known as AXPY-4, which is an extended version of the AXPY Basic Liner Algebra Subprogram.

/*@ begin Loop ( 
    transform Unroll(ufactor=4) 
    for (i=0; i<=N-1; i++)
      y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
) @*/
for (i=0; i<=N-1; i++)
   y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
/*@ end @*/

In order to apply loop unrolling to the above code, run the following Orio command (assuming that the annotated code is stored in the file axpy4.c).

% orcc axpy4.c

By default, the transformed output code is written to the file _axpy4.c. Optionally, users can specify the name of the output file using the command option -o <file>. Below is how the output code looks like.

/*@ begin Loop ( 
    transform Unroll(ufactor=4) 
    for (i=0; i<=N-1; i++)
      y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
) @*/
  for (i=0; i<=N-1; i++)
    y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
  for (i=0; i<=N-4; i=i+4) {
    y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
    y[i+1] = y[i+1] + a1*x1[i+1] + a2*x2[i+1] + a3*x3[i+1] + a4*x4[i+1];
    y[i+2] = y[i+2] + a1*x1[i+2] + a2*x2[i+2] + a3*x3[i+2] + a4*x4[i+2];
    y[i+3] = y[i+3] + a1*x1[i+3] + a2*x2[i+3] + a3*x3[i+3] + a4*x4[i+3];
  for (; i<=N-1; i=i+1) 
    y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
/*@ end @*/

In this AXPY-4 example, the name of the code transformation module used to perform loop unrolling is Loop. The AXPY-4 computation is rewritten in the module body along with the loop unrolling performance hints (i.e. an unroll factor of four). The resulting unrolled code comprises two loops: one loop with the fully unrolled body, and another loop for any remaining iterations that are not executed in the unrolled loop. Additionally, the generated code include the original code (initially written in the annotation body area) that can be executed through setting the ORIGCODE preprocessor variable accordingly.

Using Orio as an Automatic Performance Tool

To enhance the performance of a program on target architecture, most compilers select the optimal values of program transformation parameters using analytical models. In contrast, Orio adaptively generates a large number of code candidates with different parameter values for a given computation, followed by empirical executions of these code variants on the target machine. Then the code that yields the best performance is chosen. Orio automates such empirical performance tuning process using annotations, as exemplified in the following simple program.

/*@ begin PerfTuning (                                                                                 
 def build {                                                                                           
   arg command = 'gcc';                                                                                 
   arg options = '-O3';                                                                                 
 def performance_params {                                                                              
   param UF[] = range(1,33);                                                                      
 def input_params {                                                                                    
   param N[] = [1000,10000000];                                                                         
 def input_vars {                                                                                      
   decl static double y[N] = 0;                                                                         
   decl double a1 = random;                                                                             
   decl double a2 = random;                                                                             
   decl double a3 = random;                                                                             
   decl double a4 = random;                                                                             
   decl static double x1[N] = random;                                                                   
   decl static double x2[N] = random;                                                                   
   decl static double x3[N] = random;                                                                   
   decl static double x4[N] = random;                                                                   
) @*/
int i;
/*@ begin Loop (                                                                                       
    transform Unroll(ufactor=UF)                                                                       
    for (i=0; i<=N-1; i++)                                                                             
      y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];                                         
) @*/
for (i=0; i<=N-1; i++)
  y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
/*@ end @*/
/*@ end @*/

The tuned application in the given example is the same AXPY-4 used in the earlier subsection. The goal of the tuning process is to determine the most optimal value of the unroll factor parameter for different problem sizes. The code located in the PerfTuning module body section defines the tuning specifications that include the following four definitions:

  • build: to specify all information needed for compiling and executing the optimized code
  • performance_params: to specify values of parameters used in the program transformations
  • input_params: to specify sizes of the input problem
  • input_vars: to specify both the declarations and the initializations of the input variables

So in this example, the transformed AXPY-4 code is compiled using GCC compiler with the -O3 option to activate all its optimizations. The unroll factor values under consideration extends over integers from 1 to 32, inclusively. The AXPY-4 computation is tuned for two distinct problem sizes: N=1K and N=10M. Also, all scalars and arrays involved in the computation are declared and initialized in the tuning specifications to enable the performance testing driver to empirically execute the optimized code. It is to be noted that the static and dynamic keywords provide guidance to the performance testing driver on how it should allocate memory space for the declared arrays.

As discussed before, Orio performance tuning is performed for each different problem size. The number of generated programs is therefore equivalent to the number of distinct combinations of input problem sizes. So, there are two generated program outputs in the AXPY-4 example. Using the default file naming convention, _axpy_N_1000.c and _axpy_N_10000000.c output files represent the outcomes of Orio optimization process for input sizes N=1K and N=10M, respectively.

Selecting Parameter Space Exploration Strategy

A conceptually straightforward approach to exploring the space of the parameter values is via an exhaustive search procedure. However, this exhaustive approach often becomes infeasible because the size of the search space can be exponentially large. Hence, a proper search heuristic becomes a critical component of an empirical tuning system. In addition to an exhaustive search and a random search, two effective and practical search heuristic strategies have been developed and integrated into the Orio’s search engine. These heuristics include the Nelder-Mead Simplex method and Simulated Annealing method. The exhaustive approach is selected as the default space exploration method of Orio. However, Orio user can indicate his preferred search strategy in the tuning specifications, as exemplified in the search definition below.

def search {
 arg algorithm = 'Simplex';  
 arg time_limit = 10;
 arg total_runs = 10;
 arg simplex_local_distance = 2;
 arg simplex_reflection_coef = 1.0;
 arg simplex_expansion_coef = 1.5;
 arg simplex_contraction_coef = 0.5;
 arg simplex_shrinkage_coef = 0.7;

Orio users can also specify the terminating criteria of the search strategies by providing values to the arguments time_limit and total_runs. If the search time exceeds the specified time limit, the search is suspended and then Orio returns the best optimized code so far. The total number of runs enforces the search to finish in a specific quantity of full search moves. So, the example above indicates that the Simplex search method must terminate within ten-minute time constraint and within ten search convergences.

A search technique sometimes has several parameters that need to be specified. For instance, the Nelder-Mead Simplex algorithm necessitates four kinds of coefficients: reflection, expansion, contraction, and shrinkage; and all of these coefficients have default values already defined in the Orio implementation. To alter the values of these algorithm-specific parameters, users can optionally specify them in the tuning specifications. In the example presented above, all arguments with names that start with simplex_ are called search parameters specifically designed to steer the Simplex algorithm.

To further improve the quality of the search result, each search heuristic is enhanced by applying a local search after the search completes. The local search compares the best performance with neighboring coordinates. If a better coordinate is discovered, the local search continues recursively until no further improvement is possible. In the previous example, users can adjust the distance of the local search by modifying the value of the argument simplex_local_distance. A local distance of two implies that the local search examines the performances of all neighbors within a distance of two. It is important to note that the local search is turned off by default for all search heuristics. Thus to activate the local search, Orio users must explicitly assign a positive integer value to the local_distance algorithm-specific argument.

The following table lists information about the search techniques implemented in the Orio's search engine.

Search technique Keyword Algorithm-specific argument Default value Argument description
Exhaustive 'Exhaustive' - - -
Random 'Random' local_distance 0 the maximum distance of neighboring coordinates considered by the local search
Nelder-Mead simplex 'Simplex' local_distance
the maximum distance of neighboring coordinates considered by the local search
the amplitude/intensity of the reflection move
the amplitude/intensity of the expansion move
the amplitude/intensity of the contraction move
the amplitude/intensity of the shrinkage move
Simulated annealing 'Annealing' local_distance
the maximum distance of neighboring coordinates considered by the local search
the temperature reduction factor
the percentage of the termination temperature
the maximum limit of numbers of search trials at each temperature
the maximum limit of numbers of successful search moves at each temperature

Developer Guide

Under construction


  1. Tuning Specifications


  • Boyana Norris, Albert Hartono, and William Gropp. "Annotations for Productivity and Performance Portability," in Petascale Computing: Algorithms and Applications. Computational Science. Chapman & Hall / CRC Press, Taylor and Francis Group, 2007. Preprint ANL/MCS-P1392-0107. (bib, pdf)

Old Documentation (To be removed soon)

  1. Performance Study