Background
From the beginning, parallel-netcdf has taken an MPI-INFO object in the create and open routines. This object has to this point been used to pass hints down to the MPI-IO layer, but nothing precluded using those hint objects at the pnetcdf layer. We added a new feature in 1.1.0 that uses these MPI info hints to align the starting location of non-record variables.
Usage
Set the MPI Info hint "striping_unit" to the desired alignment.
MPI_Info_set(info, "stripping_unit", "4194304"); ncmpi_create(MPI_COMM_WORLD, "filename.nc", mode, info, &ncid);
Why use this?
Two scenarios come to mind.
First, if you are writing to a block-based parallel file system, such as IBM's GPFS, then an application write becomes a block write at the file system layer. If a write straddles two blocks, then locks must be acquired for both blocks. Aligning the start of a variable to a block boundary, combined with collective I/O optimizations in the MPI-IO library can often eliminate all unaligned file system accesses.
Second, since we achieve alignment of variables by padding out to the next block boundary, we end up with extra space between the end of the header describing the entire dataset and the first variable of that dataset. If you have an application that periodically wishes to add more variables to a dataset, this typically resulted in an expensive move of the entire data file to make room for the definition of the new variable. Even if you do not align to block boundaries for the file system, setting the alignment value to something that is big enough to accommodate any additional variables means you may leave your application code as-is and yet still see tremendous performance improvements.
Example scenarios
Alignment for file system: Argonne's BlueGene system has a GPFS file system with a 4MB block size. By setting the "striping_unit" hint to 4MB, pnetcdf will round up the starting offset of all non-record variables to the next 4MB.
Padding header for future growth: An application creates a checkpoint file, but has structured the code so that each component writes its own information to the checkpoint file. Since we don't know 100% of the information at the initial define mode time, this application has to call ncmpi_redef() for each component. If this call results in a bigger header on disk, then pnetcdf initiates a very expensive data movement step. However, the application could make use of this non-record variable alignment hint to ensure there is enough room for the header to grow and accommodate additional variables. The hint need not necessarily be the size of the file system block, though some rational fraction of the file system blocksize is probably a good idea. For example, on Argonne's GPFS (with an fs blocksize of 4MB), setting the "striping_unit" hint to 128k will leave room for an enormous header while still making it possible to avoid a few unaligned file system accesses.
Limitations
This hint will have no impact on the alignment of record variables. I (robl) tried but could not get the correctness tests to pass. Patches welcome!
![(please configure the [header_logo] section in trac.ini)](/projects/parallel-netcdf/chrome/common/trac_banner.png)