Miscellaneous Machine Notes
Blue Gene/P at Argonne
- ALCF pages
- Simple timers
- PAPI 3.0.9 was ported and installed by IBM in the usual location: /soft/apps/papi-c-3.0.9
- true 64-bit counters in BG/P and all counters for PAPI always starting from a true zero (no write allowed)
- 455 total native counters, max multiplexing: 256
- Implementation details: The BG/P hardware performance counters are the values of special CPU registers. Each 64-bit counter is split into a 12-bit low-order portion and a 52-bit high-order portion. Implemented as SRAM array, the high-order logic sequentially polls the low-order counters and increments an SRAM word if the overflow bit of the corresponding low-order counter is set. Each counter can be configured to generate an interrupt if a specified threshold value is reached.
- PAPI counter tests:
- available PAPI counters (not including native)
- native BGP counters
- Dynamic loading of libraries on BGP for IBM and GCC compiler
- Creating the shared library:
- bgxlc -qpic -c <source files>
- bgxlc -G -o lib<name>.so <list of object files> -ldl
- gcc -fPIC -c <source files>
- gcc -shared -Wl,-soname,lib<name>.so -o lib<name>.so <list of object files> -ldl
- Compile application with and enable dynamic linking
- bgxlc -qnostaticlink -c <source files>
- bgxlc -qnostaticlink -o <executable name> <list of object files>
- gcc -fPIC -c <source files>
- gcc -dynamic -o <executable name> <list of object files>
- Set LD_PRELOAD when submitting job
- qsub -t # -n # --env LD_PRELOAD=$LD_PRELOAD:<link to shared object> <executable name>
- Sample test case
- Test program: http://www.mcs.anl.gov/~vbui/cpi.c
- Source for interposer (build shared object): http://www.mcs.anl.gov/~vbui/interposer.c
- Source and Makefiles for example: http://www.mcs.anl.gov/~vbui/dynamic-load.tar.gz
- References
- Porting Open-Source software to BGP: http://www.fz-juelich.de/jsc/datapool/jugene/OSS_on_BGP.pdf
- Building library interposers: http://developers.sun.com/solaris/articles/lib_interposers.html
- Creating the shared library:
Various hardware performance parameters (looked up from IBM docs, Bob Walkup's whitepaper, this BG/L paper and its corresponding presentation).
- Clock rate: 850 MHz, approximately (1/850*106) = 1.17647 ns cycle
- Processing Units
- Single integer unit (FXU)
- Single load/store unit (LSU)
- Special double floating-point unit (DFPU) : 32 primary floating-point registers, 32 secondary floating-point registers; supports both standard PowerPC and SIMD instructions
Instruction latency throughput/cycle fadd 5 cycles 1 fmadd 5 cycles 1 fpmadd 5 cycles 1 fdiv 30 cycles 1/30
- Example cycle counts for other operations:
libm.a libmass.a libmassv.a exp 185 64 22 log 320 80 25 pow 460 176 29 – 48 sqrt 106 46 8-10 rsqrt 136 … 6-7 1/x 30(fdiv) … 4-5
- Theoretical flop limit = 1 fpmadd per cycle => 4 floating-point ops per cycle.
- per core: 4 ops/cycle * 0.850 GHz = 3.4 GFLOPS
- per node: 4 cores * 3.4 = 13.6 GFLOPS
- L1 latency: 4 cycles ~ 4.7ns
- L2 latency: 12 cycles ~ 14ns (or 11 cycles ~ 13ns, depending on where we look it up)
- L3 latency: 28/36/40 cycles ~ 33/42.3/47 ns (or from Bob Walkup's whitepaper, ~50 cycles ~ 59 ns)
- Main memory latency: 86 cycles ~ 101.2 ns (or from Bob Walkup's whitepaper, ~104 cycles ~ 122.3 ns)
From another Bob Walkup paper:
- L1 Data cache : 32 KB total size, 32-Byte line size, 64-way associative, round-robin replacement, write-through for cache coherency, 4-cycle load to use
- L2 Data cache : prefetch buffer, holds 15 128-byte lines can prefetch up to 7 streams
- L3 Data cache : 2x4 MB, ~50 cycles latency,
- on-chip Memory : 2048 MB DDR2 at 425 MHz, ~104 cycles latency, ~16 GB/sec bandwidth limit
- derived peak memory bandwidth: ~50.66 cycles per byte
Cookie (Xeon MCS workstation)
Cookie has dual quad-core E5462 Xeon processors (8 cores total) running at 2.8 GHz (1600 MHz FSB) with 32 KB L1 cache, 12 MB of L2 cache (6 MB shared per core pair), and 2 GB of DDR2 FBDIMM RAM, running Linux kernel version 2.6.25 (x86-64).
From the Intel docs on the 5400 series:
The Quad-Core Intel® Xeon® Processor 5400 Series maintains the tradition of compatibility with IA-32 software. Some key features include on-die, primary 32-kB instruction cache and 32-kB write-back data cache in each core and 12 MB (2 x 6MB) Level 2 cache with Intel® Advanced Smart Cache architecture. The processors’ Data Prefetch Logic speculatively fetches data to the L2 cache before an L1 cache requests occurs, resultingin reduced effective bus latency and improved performance. The 1600 MHz Front Side Bus (FSB) is a quad-pumped bus running off a 400 MHz system clock making 12.80GBytes per second data transfer rates possible.
Various performance parameters (measured with LMBench):
- Clock rate: lmbench measurement is 2799 MHz, 0.3573 nanosec clock (the official rate is 2.8 GHz)
- Peak processor performance
- per core: 4 ops/cycle * 2.8 GHz = 11.2 GFLOPS
- per processor: 4 cores * 11.2 = 44.8 GFLOPS
- whole machine: 2 processors * 44.8 = 89.6 GFLOPS
- L1 (read) latency: 2.8 cycles ~ 1 ns
- L2 (read) latency: 15.1 cycles ~ 5.4 ns
- Main memory (read) latency:
- LMBench: worst case 272 cycles ~ 97.1 ns; average 188 cycles ~ 67 ns;
- Paper: 134.4 ~ 48 ns
- Main memory bandwidth:
- LMBench:
- L1 (32KB data + 32KB instruction/core, lmbench doubles the amount specified, thus we give it 16K/core):
- bw_mem -N 10 -P 4 16384 rd (read, 4 cores): <size,MB/s> = <0.016384, 178574>
- bw_mem -N 10 -P 4 16384 rdwr (read/write, 4 cores): <size,MB/s> = <0.016384, 68088>
- bw_mem -N 10 -P 4 16384 wr (read, 4 cores): <size,MB/s> = <0.016384, 173340>
- L2 (12MB shared among 4 cores, lmbench doubles the amount specified, thus we give it 1536K/core):
- bw_mem -N 10 -P 4 1536k rd (read, 4 cores): <size,MB/s> = <1.57, 88297>
- bw_mem -N 10 -P 4 1536k rdwr (read/write, 4 cores): <size,MB/s> = <1.57, 48510>
- bw_mem -N 10 -P 4 1536k wr (read, 4 cores): <size,MB/s> = <1.57, 68455>
- Main memory:
- bw_mem -P 4 256m rd (read, 4 cores): <size,MB/s> = <268.44, 7668>
- bw_mem -P 1 256m rd (read, 1 core): <size,MB/s> = <268.44, 3585>
- bw_mem -P 4 256m rdwr (read/write, 4 cores): <size,MB/s> = <268.44, 6044>
- bw_mem -P 1 256m rdwr (read/write, 1 core): <size,MB/s> = <268.44, 2754>
- bw_mem -P 4 256m wr (write, 4 cores): <size,MB/s> = <268.44, 6034>
- bw_mem -P 1 256m wr (write, 1 core): <size,MB/s> = <268.44, 2715>
- L1 (32KB data + 32KB instruction/core, lmbench doubles the amount specified, thus we give it 16K/core):
- Notes: on a single core, the rd bw is 3585 MB/s, so we have 3585/2800 = 1.28 bytes per cycle. The theoretical peak is 4 FLOPS per cycle. So anything that requires more than 1.28 / 4 = 0.32 bytes per FLOP is memory-bound.
- LMBench stream benchmark:
STREAM copy latency: 4.75 nanoseconds STREAM copy bandwidth: 3371.97 MB/sec STREAM scale latency: 4.75 nanoseconds STREAM scale bandwidth: 3367.71 MB/sec STREAM add latency: 2.31 nanoseconds STREAM add bandwidth: 10401.62 MB/sec STREAM triad latency: 6.77 nanoseconds STREAM triad bandwidth: 3547.15 MB/sec
- LMBench:
- Operation parallelism ./par_ops
integer bit parallelism: 2.96 integer add parallelism: 1.99 integer mul parallelism: 3.20 integer div parallelism: 1.86 integer mod parallelism: 2.11 int64 bit parallelism: 2.60 int64 add parallelism: 1.84 int64 mul parallelism: 3.97 int64 div parallelism: 1.43 int64 mod parallelism: 1.27 float add parallelism: 3.00 float mul parallelism: 4.15 float div parallelism: 1.28 double add parallelism: 3.00 double mul parallelism: 5.15 double div parallelism: 1.17
- Results from Intel's version of Linpack (sequential)
Intel(R) LINPACK data Current date/time: Mon Jul 27 20:25:19 2009 CPU frequency: 2.800 GHz Number of CPUs: 8 Number of threads: 8 Parameters are set to: Number of tests : 15 Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000 Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000 Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1 Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 Maximum memory requested that can be used = 16200901024, at the size = 45000 ============= Timing linear equation system solver ================= Size LDA Align. Time(s) GFlops Residual Residual(norm) 1000 1000 4 0.065 10.3362 1.290190e-12 4.399880e-02 1000 1000 4 0.017 38.3079 1.290190e-12 4.399880e-02 1000 1000 4 0.017 38.2627 1.290190e-12 4.399880e-02 1000 1000 4 0.017 38.3319 1.290190e-12 4.399880e-02 2000 2000 4 0.110 48.4755 5.256129e-12 4.572188e-02 2000 2000 4 0.108 49.4378 5.256129e-12 4.572188e-02 5000 5008 4 1.363 61.1828 2.465550e-11 3.438011e-02 5000 5008 4 1.397 59.7063 2.465550e-11 3.438011e-02 10000 10000 4 9.803 68.0264 9.107376e-11 3.211353e-02 10000 10000 4 9.669 68.9730 9.107376e-11 3.211353e-02 15000 15000 4 30.093 74.7829 2.125512e-10 3.347717e-02 15000 15000 4 30.099 74.7676 2.125512e-10 3.347717e-02 18000 18008 4 50.985 76.2700 3.157435e-10 3.457779e-02 18000 18008 4 50.983 76.2729 3.157435e-10 3.457779e-02 20000 20016 4 69.163 77.1242 3.972194e-10 3.516262e-02 20000 20016 4 69.187 77.0970 3.972194e-10 3.516262e-02 22000 22008 4 92.068 77.1130 4.231772e-10 3.099607e-02 22000 22008 4 92.126 77.0642 4.231772e-10 3.099607e-02 25000 25000 4 133.120 78.2597 5.196674e-10 2.955162e-02 25000 25000 4 133.156 78.2387 5.196674e-10 2.955162e-02 26000 26000 4 149.381 78.4485 7.240961e-10 3.807515e-02 26000 26000 4 149.490 78.3909 7.240961e-10 3.807515e-02 27000 27000 4 167.023 78.5726 6.872751e-10 3.351503e-02 30000 30000 1 227.862 79.0032 7.597450e-10 2.994923e-02 35000 35000 1 359.141 79.5948 1.068930e-09 3.102943e-02 40000 40000 1 533.574 79.9699 1.392360e-09 3.096655e-02 45000 45000 1 758.254 80.1236 1.692018e-09 2.976927e-02 Performance Summary (GFlops) Size LDA Align. Average Maximal 1000 1000 4 31.3097 38.3319 2000 2000 4 48.9566 49.4378 5000 5008 4 60.4446 61.1828 10000 10000 4 68.4997 68.9730 15000 15000 4 74.7752 74.7829 18000 18008 4 76.2715 76.2729 20000 20016 4 77.1106 77.1242 22000 22008 4 77.0886 77.1130 25000 25000 4 78.2492 78.2597 26000 26000 4 78.4197 78.4485 27000 27000 4 78.5726 78.5726 30000 30000 1 79.0032 79.0032 35000 35000 1 79.5948 79.5948 40000 40000 1 79.9699 79.9699 45000 45000 1 80.1236 80.1236
- VTune is installed in /opt/intel/vtune on cookie. The GUI may crash when run remotely, some relevant docs on using the command-line interface are here.
Miscellaneous
To compute cycles/byte from MB/s: Cycles per byte = 1 / (MB/s * 106 * periodprocessor). The period of the processor is equal to the inverse of the processor speed (in Hertz).
![(please configure the [header_logo] section in trac.ini)](/projects/performance/chrome/common/trac_banner.png)