wiki:BGP
Last modified 13 years ago Last modified on 06/03/10 16:46:00

Blue Gene/P at Argonne

  • ALCF pages
  • Simple timers
  • PAPI 3.0.9 was ported and installed by IBM in the usual location: /soft/apps/papi-c-3.0.9
    • true 64-bit counters in BG/P and all counters for PAPI always starting from a true zero (no write allowed)
    • 455 total native counters, max multiplexing: 256
    • Implementation details: The BG/P hardware performance counters are the values of special CPU registers. Each 64-bit counter is split into a 12-bit low-order portion and a 52-bit high-order portion. Implemented as SRAM array, the high-order logic sequentially polls the low-order counters and increments an SRAM word if the overflow bit of the corresponding low-order counter is set. Each counter can be configured to generate an interrupt if a specified threshold value is reached.
  • PAPI counter tests:
  • Dynamic loading of libraries on BGP for IBM and GCC compiler

Various hardware performance parameters (looked up from IBM docs, Bob Walkup's whitepaper, this BG/L paper and its corresponding presentation).

  • Clock rate: 850 MHz, approximately (1/850*106) = 1.17647 ns cycle
  • Processing Units
    • Single integer unit (FXU)
    • Single load/store unit (LSU)
    • Special double floating-point unit (DFPU) : 32 primary floating-point registers, 32 secondary floating-point registers; supports both standard PowerPC and SIMD instructions
      Instruction  latency   throughput/cycle 
      fadd         5 cycles    1  
      fmadd        5 cycles    1 
      fpmadd       5 cycles    1 
      fdiv        30 cycles   1/30
      
  • Example cycle counts for other operations:
             libm.a   libmass.a  libmassv.a 
      exp     185       64         22 
      log     320       80         25 
      pow     460      176         29 – 48 
      sqrt    106       46         8-10 
      rsqrt   136        …         6-7 
      1/x      30(fdiv)  …         4-5  
    
  • Theoretical flop limit = 1 fpmadd per cycle => 4 floating-point ops per cycle.
    • per core: 4 ops/cycle * 0.850 GHz = 3.4 GFLOPS
    • per node: 4 cores * 3.4 = 13.6 GFLOPS
  • L1 latency: 4 cycles ~ 4.7ns
  • L2 latency: 12 cycles ~ 14ns (or 11 cycles ~ 13ns, depending on where we look it up)
  • L3 latency: 28/36/40 cycles ~ 33/42.3/47 ns (or from Bob Walkup's whitepaper, ~50 cycles ~ 59 ns)
  • Main memory latency: 86 cycles ~ 101.2 ns (or from Bob Walkup's whitepaper, ~104 cycles ~ 122.3 ns)

From another Bob Walkup paper:

  • L1 Data cache : 32 KB total size, 32-Byte line size, 64-way associative, round-robin replacement, write-through for cache coherency, 4-cycle load to use
  • L2 Data cache : prefetch buffer, holds 15 128-byte lines can prefetch up to 7 streams
  • L3 Data cache : 2x4 MB, ~50 cycles latency,
  • on-chip Memory : 2048 MB DDR2 at 425 MHz, ~104 cycles latency, ~16 GB/sec bandwidth limit
  • derived peak memory bandwidth: ~50.66 cycles per byte