wiki:Orio

Version 42 (modified by hartono, 15 years ago) (diff)

--

ORIO

An annotation-based performance optimization tool


Overview

Orio is an extensible annotation system, implemented in Python, that aims at improving both performance and productivity by enabling software developers to insert annotations into their source code (in C/C++) that trigger a number of low-level performance optimizations on a specified code fragment. The tool generates many tuned versions of the same operation using different optimization parameters, and performs an empirical search for selecting the best among multiple optimized code variants.

Download

Orio 0.0.1 (alpha)

Software Requirements

The requirement of instaling and using Orio is Python, which is widely available in any Linux/Unix? distribution. Orio has been tested successfully with Python 2.5.1 (or any later version).

Quick Install

The Orio installation follows the standard Python Module Distribution Utilities, or Disutils for short.

For users who want to quickly install Orio to the standard locations of third-party Python modules (requiring superuser privileges in a Unix system), the installation is straightforward as shown below.

% tar -xvzf orio.tar.gz
% cd orio
% python setup.py install

On a Unix platform, the above install command will normally put an orcc script in the /usr/bin location, and also create an orio module directory in the /usr/lib/python2.X/site-packages location.

To test whether Orio has been properly installed in your system, try to execute orcc command as given below as an example.

% orcc --help

description: compile shell for Orio

usage: orcc [options] <ifile> 
  <ifile>   input file containing the annotated code

options:
  -h, --help                     display this message
  -o <file>, --output=<file>     place the output to <file>
  -v, --verbose                  verbosely show details of the results of the running program

In order to install Orio to an alternate location, users need to supply a base directory for the installation. For instance, the following command will install an orcc script under /home/username/bin, and also put an orio module under /home/username/lib/python/site-packages.

% tar -xvzf orio.tar.gz
% cd orio
% python setup.py install --prefix=/home/username

It is also important to ensure that the installed Orio module location is included in the PYTHONPATH environment variable. Similarly, users can optionally include the installed orcc script location in the PATH shell variable. To do this for the above example, the following two lines can be added in the .bashrc configuration file (assuming the user uses Bash shell, of course).

export PYTHONPATH=$PYTHONPATH:/home/username/lib/python/site-packages
export PATH=$PATH:/home/username/bin

Structure of Orio Framework

The picture shown below depicts at a high level the structure and the optimization process of the Orio framework.





As the simplest case, Orio can be used to speed up code performance by performing a source-to-source transformation such as loop unrolling and memory alignment optimizations. First, Orio takes a C/C++ code as input, which contains syntactically structured comments utilized to express various performance-tuning directives. Orio scans the annotated input code and extracts all annotation regions. Each annotation region is then passed to the code transformation modules for potential optimizations. Next, the code generator produces the final code with various optimizations being applied.

Furthermore, Orio can also be used as an automatic performance tuning tool. The system uses its code transformation modules and code generator to generate an optimized code version for each distinct combination of performance parameters. And then, the optimized code version is empirically executed and measured for its performance, which is subsequently compared to the performances of other previously tested code variants. After iteratively evaluating all code variants under consideration, the best-performing code is generated as the final output of Orio.

Because the space of all possible optimized code versions can be exponentially large, an exhaustive exploration of the search space becomes impractical. Therefore, several search heuristics are implemented in the search engine component to find a code variant with near-optimal performance.

The tuning specifications, written by users in the form of annotations, are parsed and used by Orio to guide the search and tuning process. These specifications include important information such as the underlying machine characteristics, the used compilers, the search strategy, the transformation parameters, the input data size, and so on.

User Guide

As previously discussed , Orio has two main functions: a source-to-source transformation tool and an automatic performance tuning tool. In the following subsections, simple examples are provided to offer users the quickest way to begin using Orio. But first, a brief introduction to the annotation language syntax is presented next.

Annotation Language Syntax

Orio annotation is denoted as a stylized C/C++ comment that starts with '/*@' and ends with @*/. For instance, the annotation /*@ end @*/ is used to indicate the end of an annotated code region. The following simple grammar illustrates the fundamental structure of Orio annotations.

<annotation-region> ::= <leader-annotation> <annotation-body> <trailer-annotation>
<leader-annotation> ::= /*@ begin <module-name> ( <module-body> ) @*/
<trailer-annotation> ::= /*@ end @*/

An annotation region consists of three main parts: leader annotation, annotation body, and trailer annotation. The annotation body can either be empty or contain C/C++ code that may include other nested annotation regions. A leader annotation contains the module name of the code transformation component that is loaded dynamically by Orio. A high level abstraction of the computation and the performance hints are coded in the module body inside the leader annotation and are used as input by the transformation module during the transformation and code generation phases. A trailer annotation, which has a fixed form (i.e. /*@ end @*/), closes an annotation region.

A concrete example of an annotated application code can be seen in the next subsection.

Using Orio as a Source-to-Source Code Transformation Tool

Orio has several code transformation module that have already been implemented and are ready to use. One of the transformation modules is loop unrolling, which is a loop optimization that aims to increase register reuse and to reduce branching instructions by combining instructions that are executed in multiple loop iterations into a single iteration. The below sample code demonstrates how to annotate an application code with a simple portable loop unrolling optimization, where the unroll factor used in this example is four. The original code to be optimized in this example is commonly known as AXPY-4, which is an extended version of the AXPY Basic Liner Algebra Subprogram.

/*@ begin Loop ( 
    transform Unroll(ufactor=4) 
    for (i=0; i<=N-1; i++)
      y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
) @*/
for (i=0; i<=N-1; i++)
   y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
/*@ end @*/

In order to apply loop unrolling to the above code, run the following Orio command (assuming that the annotated code is stored in the file axpy4.c).

% orcc axpy4.c

By default, the transformed output code is written to the file _axpy4.c. Optionally, users can specify the name of the output file using the command option -o <file>. Below is how the output code looks like.

/*@ begin Loop ( 
    transform Unroll(ufactor=4) 
    for (i=0; i<=N-1; i++)
      y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
) @*/
#if ORIGCODE
  for (i=0; i<=N-1; i++)
    y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
#else
  for (i=0; i<=N-4; i=i+4) {
    y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
    y[i+1] = y[i+1] + a1*x1[i+1] + a2*x2[i+1] + a3*x3[i+1] + a4*x4[i+1];
    y[i+2] = y[i+2] + a1*x1[i+2] + a2*x2[i+2] + a3*x3[i+2] + a4*x4[i+2];
    y[i+3] = y[i+3] + a1*x1[i+3] + a2*x2[i+3] + a3*x3[i+3] + a4*x4[i+3];
  }
  for (; i<=N-1; i=i+1) 
    y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
#endif
/*@ end @*/

In this AXPY-4 example, the name of the code transformation module used to perform loop unrolling is Loop. The AXPY-4 computation is rewritten in the module body along with the loop unrolling performance hints (i.e. an unroll factor of four). The resulting unrolled code comprises two loops: one loop with the fully unrolled body, and another loop for any remaining iterations that are not executed in the unrolled loop. Additionally, the generated code include the original code (initially written in the annotation body area) that can be executed through setting the ORIGCODE preprocessor variable accordingly.

Using Orio as an Automatic Performance Tool

To enhance the performance of a program on target architecture, most compilers select the optimal values of program transformation parameters using analytical models. In contrast, Orio adaptively generates a large number of code candidates with different parameter values for a given computation, followed by empirical executions of these code variants on the target machine. Then the code that yields the best performance is chosen. Orio automates such empirical performance tuning process using annotations, exemplified in the following simple program.

/*@ begin PerfTuning (                                                                                 
 def build {                                                                                           
   arg command = 'gcc';                                                                                 
   arg options = '-O3';                                                                                 
 }                                                                                                     
 def performance_params {                                                                              
   param UF[] = range(1,33);                                                                      
 }                                                                                                     
 def input_params {                                                                                    
   param N[] = [1000,10000000];                                                                         
 }                                                                                                     
 def input_vars {                                                                                      
   decl static double y[N] = 0;                                                                         
   decl double a1 = random;                                                                             
   decl double a2 = random;                                                                             
   decl double a3 = random;                                                                             
   decl double a4 = random;                                                                             
   decl static double x1[N] = random;                                                                   
   decl static double x2[N] = random;                                                                   
   decl static double x3[N] = random;                                                                   
   decl static double x4[N] = random;                                                                   
 }                                                                                                     
) @*/
int i;
/*@ begin Loop (                                                                                       
    transform Unroll(ufactor=UF)                                                                       
    for (i=0; i<=N-1; i++)                                                                             
      y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];                                         
) @*/
for (i=0; i<=N-1; i++)
  y[i] = y[i] + a1*x1[i] + a2*x2[i] + a3*x3[i] + a4*x4[i];
/*@ end @*/
/*@ end @*/

The tuned application in the given example is the same AXPY-4 used in the earlier subsection. The goal of the tuning process is to determine the most optimal value of the unroll factor parameter for different problem sizes. The code located in the PerfTuning module body section defines the tuning specifications that include the following four definitions:

  • build: to specify all information needed for compiling and executing the optimized code
  • performance_params: to specify values of parameters used in the program transformations
  • input_params: to specify sizes of the input problem
  • input_vars: to specify both the declarations and the initializations of the input variables

So in this example, the transformed AXPY-4 code is compiled using GCC compiler with the -O3 option to activate all its optimizations. The unroll factor values under consideration extends over integers from 1 to 32, inclusively. The AXPY-4 computation is tuned for two distinct problem sizes: N=1K and N=10M. Also, all scalars and arrays involved in the computation must be declared and initialized in the tuning specifications to enable the performance testing driver to empirically execute the optimized code. It is to be noted that the static and dynamic keywords provide guidance to the performance testing driver as it allocates memory space for the declared arrays.

-- output files???

Using Different Search Space Exploration Strategies

Due to the huge search space of the parameter values, a proper search heuristic becomes a critical component of an empirical tuning system. Hence, in addition to an exhaustive search and a random search, two effective and practical search heuristic strategies have been developed and integrated into the Orio’s search engine. These heuristics include the Nelder-Mead Simplex method and Simulated Annealing method.

Under construction

Developer Guide

Under construction

Papers

  • Boyana Norris, Albert Hartono, and William Gropp. "Annotations for Productivity and Performance Portability," in Petascale Computing: Algorithms and Applications. Computational Science. Chapman & Hall / CRC Press, Taylor and Francis Group, 2007. Preprint ANL/MCS-P1392-0107. (bib, pdf)

Old Documentations (To be Removed)

  1. Performance Study?

Attachments