wiki:Cookie
Last modified 13 years ago Last modified on 06/03/10 16:47:36

Cookie (Xeon MCS workstation)

Cookie has dual quad-core E5462 Xeon processors (8 cores total) running at 2.8 GHz (1600 MHz FSB) with 32 KB L1 cache, 12 MB of L2 cache (6 MB shared per core pair), and 2 GB of DDR2 FBDIMM RAM, running Linux kernel version 2.6.25 (x86-64).

From the Intel docs on the 5400 series:

The Quad-Core Intel® Xeon® Processor 5400 Series maintains the tradition of compatibility with IA-32 software. Some key features include on-die, primary 32-kB instruction cache and 32-kB write-back data cache in each core and 12 MB (2 x 6MB) Level 2 cache with Intel® Advanced Smart Cache architecture. The processors’ Data Prefetch Logic speculatively fetches data to the L2 cache before an L1 cache requests occurs, resultingin reduced effective bus latency and improved performance. The 1600 MHz Front Side Bus (FSB) is a quad-pumped bus running off a 400 MHz system clock making 12.80GBytes per second data transfer rates possible.

Various performance parameters (measured with LMBench):

  • Clock rate: lmbench measurement is 2799 MHz, 0.3573 nanosec clock (the official rate is 2.8 GHz)
  • Peak processor performance
    • per core: 4 ops/cycle * 2.8 GHz = 11.2 GFLOPS
    • per processor: 4 cores * 11.2 = 44.8 GFLOPS
    • whole machine: 2 processors * 44.8 = 89.6 GFLOPS
  • L1 (read) latency: 2.8 cycles ~ 1 ns
  • L2 (read) latency: 15.1 cycles ~ 5.4 ns
  • Main memory (read) latency:
    • LMBench: worst case 272 cycles ~ 97.1 ns; average 188 cycles ~ 67 ns;
    • Paper: 134.4 ~ 48 ns
  • Main memory bandwidth:
    • LMBench:
      • L1 (32KB data + 32KB instruction/core, lmbench doubles the amount specified, thus we give it 16K/core):
        • bw_mem -N 10 -P 4 16384 rd (read, 4 cores): <size,MB/s> = <0.016384, 178574>
        • bw_mem -N 10 -P 4 16384 rdwr (read/write, 4 cores): <size,MB/s> = <0.016384, 68088>
        • bw_mem -N 10 -P 4 16384 wr (read, 4 cores): <size,MB/s> = <0.016384, 173340>
      • L2 (12MB shared among 4 cores, lmbench doubles the amount specified, thus we give it 1536K/core):
        • bw_mem -N 10 -P 4 1536k rd (read, 4 cores): <size,MB/s> = <1.57, 88297>
        • bw_mem -N 10 -P 4 1536k rdwr (read/write, 4 cores): <size,MB/s> = <1.57, 48510>
        • bw_mem -N 10 -P 4 1536k wr (read, 4 cores): <size,MB/s> = <1.57, 68455>
      • Main memory:
        • bw_mem -P 4 256m rd (read, 4 cores): <size,MB/s> = <268.44, 7668>
        • bw_mem -P 1 256m rd (read, 1 core): <size,MB/s> = <268.44, 3585>
        • bw_mem -P 4 256m rdwr (read/write, 4 cores): <size,MB/s> = <268.44, 6044>
        • bw_mem -P 1 256m rdwr (read/write, 1 core): <size,MB/s> = <268.44, 2754>
        • bw_mem -P 4 256m wr (write, 4 cores): <size,MB/s> = <268.44, 6034>
        • bw_mem -P 1 256m wr (write, 1 core): <size,MB/s> = <268.44, 2715>
      Notes: on a single core, the rd bw is 3585 MB/s, so we have 3585/2800 = 1.28 bytes per cycle. The theoretical peak is 4 FLOPS per cycle. So anything that requires more than 1.28 / 4 = 0.32 bytes per FLOP is memory-bound.
    • LMBench stream benchmark:
      STREAM copy latency: 4.75 nanoseconds
      STREAM copy bandwidth: 3371.97 MB/sec
      STREAM scale latency: 4.75 nanoseconds
      STREAM scale bandwidth: 3367.71 MB/sec
      STREAM add latency: 2.31 nanoseconds
      STREAM add bandwidth: 10401.62 MB/sec
      STREAM triad latency: 6.77 nanoseconds
      STREAM triad bandwidth: 3547.15 MB/sec
      
  • Operation parallelism ./par_ops
    integer bit parallelism: 2.96
    integer add parallelism: 1.99
    integer mul parallelism: 3.20
    integer div parallelism: 1.86
    integer mod parallelism: 2.11
    int64 bit parallelism: 2.60
    int64 add parallelism: 1.84
    int64 mul parallelism: 3.97
    int64 div parallelism: 1.43
    int64 mod parallelism: 1.27
    float add parallelism: 3.00
    float mul parallelism: 4.15
    float div parallelism: 1.28
    double add parallelism: 3.00
    double mul parallelism: 5.15
    double div parallelism: 1.17
    
  • Results from Intel's version of Linpack (sequential)
    Intel(R) LINPACK data
    
    Current date/time: Mon Jul 27 20:25:19 2009
    
    CPU frequency:    2.800 GHz
    Number of CPUs: 8
    Number of threads: 8
    Parameters are set to:
    
    Number of tests                             : 15
    Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
    Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
    Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
    Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    
    
    Maximum memory requested that can be used = 16200901024, at the size = 45000
    ============= Timing linear equation system solver =================
    
    Size   LDA    Align. Time(s)    GFlops   Residual      Residual(norm)
    1000   1000   4      0.065      10.3362  1.290190e-12 4.399880e-02
    1000   1000   4      0.017      38.3079  1.290190e-12 4.399880e-02
    1000   1000   4      0.017      38.2627  1.290190e-12 4.399880e-02
    1000   1000   4      0.017      38.3319  1.290190e-12 4.399880e-02
    2000   2000   4      0.110      48.4755  5.256129e-12 4.572188e-02
    2000   2000   4      0.108      49.4378  5.256129e-12 4.572188e-02
    5000   5008   4      1.363      61.1828  2.465550e-11 3.438011e-02
    5000   5008   4      1.397      59.7063  2.465550e-11 3.438011e-02
    10000  10000  4      9.803      68.0264  9.107376e-11 3.211353e-02
    10000  10000  4      9.669      68.9730  9.107376e-11 3.211353e-02
    15000  15000  4      30.093     74.7829  2.125512e-10 3.347717e-02
    15000  15000  4      30.099     74.7676  2.125512e-10 3.347717e-02
    18000  18008  4      50.985     76.2700  3.157435e-10 3.457779e-02
    18000  18008  4      50.983     76.2729  3.157435e-10 3.457779e-02
    20000  20016  4      69.163     77.1242  3.972194e-10 3.516262e-02
    20000  20016  4      69.187     77.0970  3.972194e-10 3.516262e-02
    22000  22008  4      92.068     77.1130  4.231772e-10 3.099607e-02
    22000  22008  4      92.126     77.0642  4.231772e-10 3.099607e-02
    25000  25000  4      133.120    78.2597  5.196674e-10 2.955162e-02
    25000  25000  4      133.156    78.2387  5.196674e-10 2.955162e-02
    26000  26000  4      149.381    78.4485  7.240961e-10 3.807515e-02
    26000  26000  4      149.490    78.3909  7.240961e-10 3.807515e-02
    27000  27000  4      167.023    78.5726  6.872751e-10 3.351503e-02
    30000  30000  1      227.862    79.0032  7.597450e-10 2.994923e-02
    35000  35000  1      359.141    79.5948  1.068930e-09 3.102943e-02
    40000  40000  1      533.574    79.9699  1.392360e-09 3.096655e-02
    45000  45000  1      758.254    80.1236  1.692018e-09 2.976927e-02
    
    Performance Summary (GFlops)
    
    Size   LDA    Align.  Average  Maximal
    1000   1000   4       31.3097  38.3319 
    2000   2000   4       48.9566  49.4378 
    5000   5008   4       60.4446  61.1828 
    10000  10000  4       68.4997  68.9730 
    15000  15000  4       74.7752  74.7829 
    18000  18008  4       76.2715  76.2729 
    20000  20016  4       77.1106  77.1242 
    22000  22008  4       77.0886  77.1130 
    25000  25000  4       78.2492  78.2597 
    26000  26000  4       78.4197  78.4485 
    27000  27000  4       78.5726  78.5726 
    30000  30000  1       79.0032  79.0032 
    35000  35000  1       79.5948  79.5948 
    40000  40000  1       79.9699  79.9699 
    45000  45000  1       80.1236  80.1236
    
  • Results from SKaMPI
    SKaMPI Collective Configuration: 
    set_min_repetitions(32)
    set_max_repetitions(64)
    set_max_relative_standard_error(0.03)
    set_skampi_buffer(4MB)
    end result "MPI_Bcast-nodes-short"
    # duration = 1.43 sec
    end result "MPI_Bcast-nodes-long"
    # duration = 0.22 sec
    end result "MPI_Bcast-length"
    # duration = 1.32 sec
    end result "Bcast_using_Send_Recv-nodes-short"
    # duration = 0.01 sec
    end result "Bcast_using_Send_Recv-nodes-long"
    # duration = 0.30 sec
    end result "Bcast_using_Send_Recv-length"
    # duration = 1.23 sec
    end result "MPI_Barrier-nodes"
    # duration = 0.01 sec
    end result "MPI_Reduce-nodes"
    # duration = 0.01 sec
    end result "MPI_Reduce-length"
    # duration = 0.80 sec
    end result "MPI_Allreduce-nodes"
    # duration = 0.01 sec
    end result "MPI_Allreduce-length"
    # duration = 1.29 sec
    end result "Allreduce_using_Reduce_Bcast-nodes"
    # duration = 0.01 sec
    end result "Allreduce_using_Reduce_Bcast-length"
    # duration = 1.29 sec
    end result "MPI_Reduce_scatter-nodes"
    # duration = 0.01 sec
    end result "MPI_Reduce-scatter-length"
    # duration = 0.72 sec
    end result "MPI_Alltoall-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Alltoall-nodes-long"
    # duration = 0.13 sec
    end result "MPI_Alltoall-length"
    # duration = 0.70 sec
    end result "Alltoall_using_Isend_Irecv-nodes-short"
    # duration = 0.01 sec
    end result "Alltoall_using_Isend_Irecv-nodes-long"
    # duration = 0.10 sec
    end result "Alltoall_using_Isend_Irecv-length"
    # duration = 0.33 sec
    end result "MPI_Gather-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Gather-nodes-long"
    # duration = 0.22 sec
    end result "MPI_Gather-length"
    # duration = 0.24 sec
    end result "Gather_using_Send_Recv-nodes-short"
    # duration = 0.01 sec
    end result "Gather_using_Send_Recv-nodes-long"
    # duration = 0.16 sec
    end result "Gather_using_Send_Recv-length"
    # duration = 0.17 sec
    end result "Gather_using_Isend_Irecv_Waitall-nodes-short"
    # duration = 0.01 sec
    end result "Gather_using_Isend_Irecv_Waitall-nodes-long"
    # duration = 0.16 sec
    end result "Gather_using_Isend_Irecv_Waitall-length"
    # duration = 0.20 sec
    end result "MPI_Allgather-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Allgather-nodes-long"
    # duration = 0.59 sec
    end result "MPI_Allgather-length"
    # duration = 0.76 sec
    end result "MPI_Scatter-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Scatter-nodes-long"
    # duration = 0.21 sec
    end result "MPI_Scatter-length"
    # duration = 0.21 sec
    end result "Reduce_Scatterv-nodes"
    # duration = 0.01 sec
    end result "Reduce_Scatterv-length"
    # duration = 0.79 sec
    end result "MPI_Alltoallv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Alltoallv-nodes-long"
    # duration = 0.13 sec
    end result "MPI_Alltoallv-length"
    # duration = 0.75 sec
    end result "Alltoallv_using_Isend_Irecv-nodes-short"
    # duration = 0.01 sec
    end result "Alltoallv_using_Isend_Irecv-nodes-long"
    # duration = 0.13 sec
    end result "Alltoallv_using_Isend_Irecv-length"
    # duration = 0.73 sec
    end result "MPI_Gatherv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Gatherv-nodes-long"
    # duration = 0.20 sec
    end result "MPI_Gatherv-length"
    # duration = 0.24 sec
    end result "MPI_Allgatherv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Allgatherv-nodes-long"
    # duration = 0.59 sec
    end result "MPI_Allgatherv-length"
    # duration = 0.67 sec
    end result "MPI_Scatterv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Scatterv-nodes-long"
    # duration = 0.19 sec
    end result "MPI_Scatterv-length"
    # duration = 0.21 sec
    end result "MPI_Scan-nodes"
    # duration = 0.01 sec
    end result "MPI_Scan-length"
    # duration = 2.16 sec
    end result "MPI_Comm_split-nodes"
    # duration = 0.01 sec
    end result "MPI_Comm_dup-nodes"
    # duration = 0.01 sec
    
    dt1 = MPI_FLOAT
    
    end result "MPI_Bcast-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Bcast-nodes-long"
    # duration = 0.28 sec
    end result "MPI_Bcast-length"
    # duration = 1.37 sec
    end result "Bcast_using_Send_Recv-nodes-short"
    # duration = 0.01 sec
    end result "Bcast_using_Send_Recv-nodes-long"
    # duration = 0.26 sec
    end result "Bcast_using_Send_Recv-length"
    # duration = 1.42 sec
    end result "MPI_Barrier-nodes"
    # duration = 0.01 sec
    end result "MPI_Reduce-nodes"
    # duration = 0.01 sec
    end result "MPI_Reduce-length"
    # duration = 0.97 sec
    end result "MPI_Allreduce-nodes"
    # duration = 0.01 sec
    end result "MPI_Allreduce-length"
    # duration = 1.37 sec
    end result "Allreduce_using_Reduce_Bcast-nodes"
    # duration = 0.01 sec
    end result "Allreduce_using_Reduce_Bcast-length"
    # duration = 1.37 sec
    end result "MPI_Reduce_scatter-nodes"
    # duration = 0.01 sec
    end result "MPI_Reduce-scatter-length"
    # duration = 0.74 sec
    end result "MPI_Alltoall-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Alltoall-nodes-long"
    # duration = 0.14 sec
    end result "MPI_Alltoall-length"
    # duration = 0.72 sec
    end result "Alltoall_using_Isend_Irecv-nodes-short"
    # duration = 0.01 sec
    end result "Alltoall_using_Isend_Irecv-nodes-long"
    # duration = 0.11 sec
    end result "Alltoall_using_Isend_Irecv-length"
    # duration = 0.33 sec
    end result "MPI_Gather-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Gather-nodes-long"
    # duration = 0.21 sec
    end result "MPI_Gather-length"
    # duration = 0.23 sec
    end result "Gather_using_Send_Recv-nodes-short"
    # duration = 0.01 sec
    end result "Gather_using_Send_Recv-nodes-long"
    # duration = 0.18 sec
    end result "Gather_using_Send_Recv-length"
    # duration = 0.17 sec
    end result "Gather_using_Isend_Irecv_Waitall-nodes-short"
    # duration = 0.01 sec
    end result "Gather_using_Isend_Irecv_Waitall-nodes-long"
    # duration = 0.16 sec
    end result "Gather_using_Isend_Irecv_Waitall-length"
    # duration = 0.19 sec
    end result "MPI_Allgather-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Allgather-nodes-long"
    # duration = 0.68 sec
    end result "MPI_Allgather-length"
    # duration = 0.80 sec
    end result "MPI_Scatter-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Scatter-nodes-long"
    # duration = 0.16 sec
    end result "MPI_Scatter-length"
    # duration = 0.19 sec
    end result "Reduce_Scatterv-nodes"
    # duration = 0.01 sec
    end result "Reduce_Scatterv-length"
    # duration = 0.84 sec
    end result "MPI_Alltoallv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Alltoallv-nodes-long"
    # duration = 0.13 sec
    end result "MPI_Alltoallv-length"
    # duration = 0.76 sec
    end result "Alltoallv_using_Isend_Irecv-nodes-short"
    # duration = 0.01 sec
    end result "Alltoallv_using_Isend_Irecv-nodes-long"
    # duration = 0.13 sec
    end result "Alltoallv_using_Isend_Irecv-length"
    # duration = 0.64 sec
    end result "MPI_Gatherv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Gatherv-nodes-long"
    # duration = 0.18 sec
    end result "MPI_Gatherv-length"
    # duration = 0.28 sec
    end result "MPI_Allgatherv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Allgatherv-nodes-long"
    # duration = 0.65 sec
    end result "MPI_Allgatherv-length"
    # duration = 0.77 sec
    end result "MPI_Scatterv-nodes-short"
    # duration = 0.01 sec
    end result "MPI_Scatterv-nodes-long"
    # duration = 0.16 sec
    end result "MPI_Scatterv-length"
    # duration = 0.18 sec
    end result "MPI_Scan-nodes"
    # duration = 0.01 sec
    end result "MPI_Scan-length"
    # duration = 2.30 sec
    end result "MPI_Comm_split-nodes"
    # duration = 0.01 sec
    end result "MPI_Comm_dup-nodes"
    # duration = 0.01 sec
    
    
Onesided Configuration:

set_min_repetitions(32)
set_max_repetitions(64)
set_max_relative_standard_error(0.03)
set_skampi_buffer_mpi_alloc_mem(4mb,MPI_INFO_NULL)
dt = MPI_INT
np = get_comm_size(MPI_COMM_WORLD)
comm_pt2pt_slow = comm2_max_latency_with_root()
no_assert = 0

end result "MPI_Win_fence_open"
# duration = 0.34 sec
end result "MPI_Win_fence_close"
# duration = 0.07 sec
end result "MPI_Win_fence_close-noput"
# duration = 0.06 sec
end result "MPI_Win_fence_close-nodes-short"
# duration = 0.01 sec
end result "MPI_Win_fence_close-nodes-long"
# duration = 0.01 sec
end result "MPI_Win_fence_openclose"
# duration = 0.01 sec
end result "MPI_Win_start"
# duration = 0.01 sec
end result "MPI_Win_start with MPI_MODE_NOCHECK"
# duration = 0.01 sec
end result "MPI_Win_start with delayed MPI_Win_post"
# duration = 0.84 sec
end result "MPI_Win_post"
# duration = 0.01 sec
end result "MPI_Win_post with MPI_MODE_NOCHECK"
# duration = 0.01 sec
end result "MPI_Win_post with MPI_MODE_NOSTORE"
# duration = 0.01 sec
end result "MPI_Win_complete"
# duration = 0.81 sec
end result "MPI_Win_complete without MPI_Put"
# duration = 0.01 sec
end result "MPI_Win_complete delayed"
# duration = 15.65 sec
end result "MPI_Win_complete delayed, small"
# duration = 19.14 sec
end result "MPI_Win_complete delayed without MPI_Put"
# duration = 21.11 sec
end result "MPI_Win_wait delayed"
# duration = 2.21 sec
end result "MPI_Win_wait delayed,small"
# duration = 15.20 sec
end result "MPI_Win_wait delayed without MPI_Put"
# duration = 2.17 sec
end result "MPI_Win_test"
# duration = 0.10 sec
end result "MPI_Win_test without MPI_Put"
# duration = 0.00 sec
end result "MPI_Win_test delayed"
# duration = 0.89 sec
end result "MPI_Win_test delayed without MPI_Put"
# duration = 0.83 sec
end result "MPI_Win_create"
# duration = 0.01 sec
end result "MPI_Win_free"
# duration = 0.01 sec
end result "MPI_Win_free_noput"
# duration = 0.01 sec
end result "Pingpong_Put_Put_fence"
# duration = 0.13 sec
end result "MPI_Put callduration"
# duration = 0.24 sec
end result "MPI_Put_dedicated"
# duration = 0.10 sec
end result "MPI_Put_passive"
# duration = 0.10 sec
end result "MPI_Put_fence"
# duration = 0.10 sec
end result "MPI_Put_maxlatency"
# duration = 0.29 sec
end result "MPI_Accumulate MPI_REPLACE"
# duration = 0.21 sec
end result "MPI_Accumulate MPI_MAX"
# duration = 0.18 sec
end result "MPI_Accumulate MPI_SUM"
# duration = 0.17 sec
end result "Combining_short_Puts_fence"
# duration = 1.02 sec
end result "Combining_measure_large_Put"
# duration = 1.59 sec

dt1 = MPI_FLOAT

end result "MPI_Win_fence_open"
# duration = 0.46 sec
end result "MPI_Win_fence_close"
# duration = 0.06 sec
end result "MPI_Win_fence_close-noput"
# duration = 0.07 sec
end result "MPI_Win_fence_close-nodes-short"
# duration = 0.01 sec
end result "MPI_Win_fence_close-nodes-long"
# duration = 0.01 sec
end result "MPI_Win_fence_openclose"
# duration = 0.01 sec
end result "MPI_Win_start"
# duration = 0.01 sec
end result "MPI_Win_start with MPI_MODE_NOCHECK"
# duration = 0.01 sec
end result "MPI_Win_start with delayed MPI_Win_post"
# duration = 0.44 sec
end result "MPI_Win_post"
# duration = 0.01 sec
end result "MPI_Win_post with MPI_MODE_NOCHECK"
# duration = 0.01 sec
end result "MPI_Win_post with MPI_MODE_NOSTORE"
# duration = 0.01 sec
end result "MPI_Win_complete"
# duration = 0.82 sec
end result "MPI_Win_complete without MPI_Put"
# duration = 0.01 sec
end result "MPI_Win_complete delayed"
# duration = 17.74 sec
end result "MPI_Win_complete delayed, small"
# duration = 17.17 sec
end result "MPI_Win_complete delayed without MPI_Put"
# duration = 17.17 sec
end result "MPI_Win_wait delayed"
# duration = 2.22 sec
end result "MPI_Win_wait delayed,small"
# duration = 15.20 sec
end result "MPI_Win_wait delayed without MPI_Put"
# duration = 2.17 sec
end result "MPI_Win_test"
# duration = 0.03 sec
end result "MPI_Win_test without MPI_Put"
# duration = 0.00 sec
end result "MPI_Win_test delayed"
# duration = 0.46 sec
end result "MPI_Win_test delayed without MPI_Put"
# duration = 0.43 sec
end result "MPI_Win_create"
# duration = 0.01 sec
end result "MPI_Win_free"
# duration = 0.01 sec
end result "MPI_Win_free_noput"
# duration = 0.01 sec
end result "Pingpong_Put_Put_fence"
# duration = 0.11 sec
end result "MPI_Put callduration"
# duration = 0.13 sec
end result "MPI_Put_dedicated"
# duration = 0.07 sec
end result "MPI_Put_passive"
# duration = 0.07 sec
end result "MPI_Put_fence"
# duration = 0.07 sec
end result "MPI_Put_maxlatency"
# duration = 0.15 sec
end result "MPI_Accumulate MPI_REPLACE"
# duration = 0.16 sec
end result "MPI_Accumulate MPI_MAX"
# duration = 0.14 sec
end result "MPI_Accumulate MPI_SUM"
# duration = 0.15 sec
end result "Combining_short_Puts_fence"
# duration = 0.90 sec
end result "Combining_measure_large_Put"
# duration = 1.01 sec

Point to Point Configuration:

set_min_repetitions(64)
set_max_repetitions(128)
set_max_relative_standard_error(0.03)
set_skampi_buffer(64kb)
dt1 = MPI_INT

end result "Pingpong_Send_Recv"
# duration = 0.02 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.02 sec
end result "Pingpong_Send_Irecv"
# duration = 0.01 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.01 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.02 sec
end result "Pingpong_Isend_Recv"
# duration = 0.02 sec
end result "Pingpong_Issend_Recv"
# duration = 0.02 sec
end result "Pingpong_Sendrecv"
# duration = 0.01 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.01 sec

dt1 = MPI_FLOAT

# duration = 0.02 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.02 sec
end result "Pingpong_Send_Irecv"
# duration = 0.01 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.01 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.02 sec
end result "Pingpong_Isend_Recv"
# duration = 0.01 sec
end result "Pingpong_Issend_Recv"
# duration = 0.02 sec
end result "Pingpong_Sendrecv"
# duration = 0.01 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.01 sec

dt1 = MPI_DOUBLE
buffer size = 64kb

end result "Pingpong_Send_Recv"
# duration = 0.01 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.01 sec
end result "Pingpong_Send_Irecv"
# duration = 0.01 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.01 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.01 sec
end result "Pingpong_Isend_Recv"
# duration = 0.01 sec
end result "Pingpong_Issend_Recv"
# duration = 0.01 sec
end result "Pingpong_Sendrecv"
# duration = 0.00 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.01 sec

dt1 = MPI_DOUBLE
buffer size = 128kb

end result "Pingpong_Send_Recv"
# duration = 0.04 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.03 sec
end result "Pingpong_Send_Irecv"
# duration = 0.04 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.02 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.02 sec
end result "Pingpong_Isend_Recv"
# duration = 0.01 sec
end result "Pingpong_Issend_Recv"
# duration = 0.02 sec
end result "Pingpong_Sendrecv"
# duration = 0.01 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.02 sec

dt1 = MPI_DOUBLE
buffer size = 256kb

end result "Pingpong_Send_Recv"
# duration = 0.08 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.09 sec
end result "Pingpong_Send_Irecv"
# duration = 0.08 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.04 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.04 sec
end result "Pingpong_Isend_Recv"
# duration = 0.04 sec
end result "Pingpong_Issend_Recv"
# duration = 0.06 sec
end result "Pingpong_Sendrecv"
# duration = 0.02 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.04 sec

dt1 = MPI_DOUBLE
buffer size = 512kb

end result "Pingpong_Send_Recv"
# duration = 0.07 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.07 sec
end result "Pingpong_Send_Irecv"
# duration = 0.07 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.07 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.07 sec
end result "Pingpong_Isend_Recv"
# duration = 0.07 sec
end result "Pingpong_Issend_Recv"
# duration = 0.07 sec
end result "Pingpong_Sendrecv"
# duration = 0.04 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.08 sec

dt1 = MPI_DOUBLE
buffer size = 1024kb

end result "Pingpong_Send_Recv"
# duration = 0.16 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.04 sec
end result "Pingpong_Send_Irecv"
# duration = 0.03 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.03 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.04 sec
end result "Pingpong_Isend_Recv"
# duration = 0.03 sec
end result "Pingpong_Issend_Recv"
# duration = 0.04 sec
end result "Pingpong_Sendrecv"
# duration = 0.02 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.05 sec

dt1 = MPI_DOUBLE
buffer size = 2MB

end result "Pingpong_Send_Recv"
# duration = 0.25 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.25 sec
end result "Pingpong_Send_Irecv"
# duration = 0.24 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.24 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.25 sec
end result "Pingpong_Isend_Recv"
# duration = 0.24 sec
end result "Pingpong_Issend_Recv"
# duration = 0.25 sec
end result "Pingpong_Sendrecv"
# duration = 0.16 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.34 sec

dt1 = MPI_DOUBLE
buffer size = 4MB

end result "Pingpong_Send_Recv"
# duration = 0.35 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.34 sec
end result "Pingpong_Send_Irecv"
# duration = 0.32 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.32 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.33 sec
end result "Pingpong_Isend_Recv"
# duration = 0.32 sec
end result "Pingpong_Issend_Recv"
# duration = 0.33 sec
end result "Pingpong_Sendrecv"
# duration = 0.23 sec
end result "Pingpong_Sendrecv_replace"
# duration = 0.58 sec

dt1 = MPI_DOUBLE
buffer size = 6MB

end result "Pingpong_Send_Recv"
# duration = 0.75 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.75 sec
end result "Pingpong_Send_Irecv"
# duration = 0.74 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.74 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.76 sec
end result "Pingpong_Isend_Recv"
# duration = 0.74 sec
end result "Pingpong_Issend_Recv"
# duration = 0.75 sec
end result "Pingpong_Sendrecv"
# duration = 0.51 sec
end result "Pingpong_Sendrecv_replace"
# duration = 1.19 sec

dt1 = MPI_DOUBLE
buffer size = 8MB

end result "Pingpong_Send_Recv"
# duration = 0.87 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 0.76 sec
end result "Pingpong_Send_Irecv"
# duration = 0.72 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 0.72 sec
end result "Pingpong_Ssend_Recv"
# duration = 0.72 sec
end result "Pingpong_Isend_Recv"
# duration = 0.72 sec
end result "Pingpong_Issend_Recv"
# duration = 0.72 sec
end result "Pingpong_Sendrecv"
# duration = 0.56 sec
end result "Pingpong_Sendrecv_replace"
# duration = 1.59 sec

dt1 = MPI_DOUBLE
buffer size = 16MB

end result "Pingpong_Send_Recv"
# duration = 2.10 sec
end result "Pingpong_Send_Iprobe_Recv"
# duration = 2.02 sec
end result "Pingpong_Send_Irecv"
# duration = 2.02 sec
end result "Pingpong_Send_Recv_AnyTag"
# duration = 2.01 sec
end result "Pingpong_Ssend_Recv"
# duration = 2.02 sec
end result "Pingpong_Isend_Recv"
# duration = 2.01 sec
end result "Pingpong_Issend_Recv"
# duration = 2.02 sec
end result "Pingpong_Sendrecv"
# duration = 1.48 sec
end result "Pingpong_Sendrecv_replace"
# duration = 3.70 sec

  • VTune is installed in /opt/intel/vtune on cookie. The GUI may crash when run remotely, some relevant docs on using the command-line interface are here.