wiki:IntrepidScaling
Last modified 12 years ago Last modified on 05/10/11 10:52:38

Set-up

Input files (with explanations and run scripts for intrepid): http://www.mcs.anl.gov/~acaceres/couptest/

Note: On intrepid, I'm not able to properly pass mbcoupler_test's "read options" through the command line. The queue system seems to eat the semi-colons. Workaround has just been to hardcode the read options in mbcoupler_test.cpp. Using:

readOpts = "PARALLEL=READ_PART;PARTITION=PARALLEL_PARTITION;PARTITION_DISTRIBUTE;PARALLEL_RESOLVE_SHARED_ENTS;CPUTIME";

Executables and data files are in my home directory (/gpfs/home/acaceres).

Building with:

./configure --host=powerpc64-unknown-linux-gnu --enable-optimize --without-netcdf --with-hdf5=/soft/apps/hdf5-1.8.0 --enable-mbcoupler --with-mpi CXX=mpixlcxx_r CC=mpixlc_r FC=mpixlf77_r

Was also using /soft/apps/hdf5-1.6.6/, might have been causing errors.

Note. Part of process is:

  • Check out from svn on intrepid
  • tar up and scp to login.mcs.anl.gov
  • untar and run autoreconf on login.mcs
  • retar and scp back to intrepid

variables:

  • debug/optimized
  • hdf 1.6 vs 1.8
  • smp, dual or vn mode
  • moab version

table: numprocs vs runtime/failure table: same as above, to be edited

input file how to run

issues:

  • timeout with low procs
  • hdf5 error
  • core dump

Mesh sizes

File ptet_0.h5m:
   type   count    total                        minimum                        average                            rms                        maximum                       std.dev.
------- ------- -------- ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
   Edge   91177  4.4e+04                        0.28873                        0.47854                        0.47931                        0.85193                       0.027167
    Tri   60998    6e+03                       0.058684                       0.098831                       0.099175                        0.22284                      0.0082616
    Tet  767277  1.1e+04                      0.0029522                       0.014395                       0.015298                       0.080395                      0.0051784
1D Side 4786656  2.5e+06                         0.2118                        0.52121                        0.52999                          1.134                        0.09608
 Vertex  134592

File pvf-hex_0.h5m:
   type   count    total                        minimum                        average                            rms                        maximum                       std.dev.
------- ------- -------- ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
   Edge   55332  2.6e+04                          0.246                        0.46534                        0.46628                        0.90091                       0.029516
   Quad   27830    6e+03                       0.081641                        0.21658                         0.2177                        0.48467                       0.021977
    Hex  110209  1.1e+04                       0.037972                        0.10022                        0.10295                        0.22543                       0.023546
1D Side 1433828  6.7e+05                          0.246                        0.46439                        0.46748                        0.90091                       0.053629
 Vertex  117216

TODO

add diagnostic to coupler.cpp to print out points being queried -- but why? we know the points are the vertices of the target mesh indep. program that loads a mesh in parallel and does point-query (from file or random points)

write debug output before and after communication. print out mpi communications done in mbcoupler_test.cpp (sanity check)


Runs that worked

Timings of successful runs of mbcoupler_test, organized by executable. First number in each row is instantiation time, others don't matter for now. Text on the right is "-n <number of nodes> -m <cores per node>" with smp=1, dual=2, vn=4.

mbcoupler_test.14-jason.readpart small mesh pair

115.5365.098850.1964124 - vn
115.5425.10210.1932084 - vn
56.41021.926670.09817158 - vn
56.41181.927660.09792778 - vn
28.03660.9793710.072244516 - vn
28.03890.980860.072217216 - vn
15.03960.6928570.042099832 - vn
15.04080.6916690.041512932 - vn
7.017430.4348390.03558964 - vn
7.017890.4343290.03635164 - vn
3.320720.333020.0305914128 - vn
3.321120.3368640.0339049128 - vn
1.602730.2407250.0424628256 - vn
1.604590.237970.0436944256 - vn
0.7825350.1399290.0354579512 - vn
0.7825750.1461090.0426446512 - vn
0.3369230.126730.04949541024 - vn
0.3383310.1215060.04472951024 - vn
0.1660490.09205260.03763752048 - vn
0.1661850.09690770.03700072048 - vn
0.08922440.09645780.03532034096 - vn
0.08952570.1002120.04399264096 - vn

mbcoupler_test.14-jason.readpart large mesh

1132.0640.38681.753932
1132.0640.43381.7546532
1132.1340.37181.7557632
557.93615.28821.0438964
284.6019.321690.609045128
284.6049.269640.611008128
137.7714.796210.339907256
137.7784.795670.339039256
66.47132.311550.199686512
66.49152.321370.202376512
33.54351.424980.1282311024
33.5521.419450.1227911024
15.56631.257930.08542212048
7.388780.7132470.06324544096
7.398850.7151640.07253974096
3.554940.6713790.0572368192
3.560140.6690510.05386998192
1.965930.6879140.049000716384

Runs that died

Unsuccessful runs, grouped by executable. Not clear yet if "maxtime exceeded" always means "the program was still thinking" versus "there was a deadlock". The smallest maxtime used was 10 minutes.

mbcoupler_test.7-opt

maxtime exceeded - read 0 mesh files -n 4 -m smp

mbcoupler_test.13-opt-1.8-readpart

maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
romio bug -n 1024 -m dual
romio bug -n 1024 -m smp
romio bug -n 1024 -m smp
romio bug -n 128 -m dual
romio bug -n 256 -m dual
romio bug -n 256 -m dual
romio bug -n 512 -m dual
romio bug -n 512 -m dual
romio bug -n 512 -m smp
romio bug -n 512 -m smp
romio bug -n 512 -m smp

mbcoupler_test.13-opt-1.8-bcastdel

171-pvf-hex_0-ptet_0 -n 128 -m smp
180-pvf-hex_0-ptet_0 -n 128 -m smp
181-pvf-hex_0-ptet_0 -n 128 -m smp
maxtime exceeded - read 1 mesh files -n 128 -m smp
maxtime exceeded - read 1 mesh files -n 128 -m smp
maxtime exceeded - read 1 mesh files -n 2 -m smp
maxtime exceeded - read 1 mesh files -n 2 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 4 -m smp
maxtime exceeded - read 2 mesh files -n 4 -m smp

mbcoupler_test.13-dbg-readpart

maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 2 -m smp

mbcoupler_test.13-dbg

killed by user -n 4 -m smp
killed by user -n 8 -m smp
maxtime exceeded - read 1 mesh files -n 2 -m smp
maxtime exceeded - read 1 mesh files -n 4 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp

mbcoupler_test.13.hdf1.8.readpart

maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m smp
maxtime exceeded - read 2 mesh files -n 1 -m vn
maxtime exceeded - read 2 mesh files -n 2 -m smp
maxtime exceeded - read 2 mesh files -n 2 -m smp
maxtime exceeded - read 2 mesh files -n 2 -m smp
maxtime exceeded - read 2 mesh files -n 2 -m smp
maxtime exceeded - read 2 mesh files -n 2 -m vn
maxtime exceeded - read 2 mesh files -n 4 -m smp
maxtime exceeded - read 2 mesh files -n 4 -m smp
maxtime exceeded - read 2 mesh files -n 4 -m smp
maxtime exceeded - read 2 mesh files -n 4 -m smp
maxtime exceeded - read 2 mesh files -n 8 -m smp
romio bug -n 128 -m vn
romio bug -n 16 -m vn
romio bug -n 32 -m vn
romio bug -n 512 -m smp
romio bug -n 64 -m vn
romio bug -n 8 -m vn

mbcoupler_test.14-jason.readpart using big mesh pair

code 139 8 - vn
code 139 8 - vn
code 139 8 - vn
code 139 8 - vn
queue system error 32 - vn
queue system error 64 - vn
queue system error 2048 - vn
queue system error 16384 - vn
romio 2 16 - vn
romio 2 16 - vn
romio 2 16 - vn
romio 2 16 - vn
romio3 4 - vn
romio3 4 - vn
romio3 4 - vn
romio3 4 - vn
romio3 64 - vn

Romio bug

Stack trace

------------------------------------------------------------------------
+++ID TGID 706, Core 0, Thread 1 State 40000000, Sched: 48000000 

0x016a7070
clone
??:0

------------------------------------------------------------------------
+++ID TGID 706, Core 0, Thread 5 State 00000000, Sched: 48000000 Running

0x016b56f4
raise
../nptl/sysdeps/unix/sysv/linux/raise.c:67

0x0167483c
abort
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdlib/abort.c:73

0x014e3af4
MPID_Abort
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/misc/mpid_abort.c:81

0x014b1edc
PMPI_Abort
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/init/abort.c:118

0x014ba168
ADIOI_Malloc_fn
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/common/malloc.c:55

0x014cb534
ADIOI_R_Exchange_data_alltoallv
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_rdcoll.c:1071

0x014ccc78
ADIOI_Read_and_exch
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_rdcoll.c:658

0x015137b4
MPIOI_File_read_all
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/mpi-io/read_all.c:107

0x014b67c4
PMPI_File_read_at_all
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/mpi-io/read_atall.c:56

0x012b8700
H5FD_mpio_read
H5FDmpio.c:0

0x012ae45c
H5FD_read
??:0

0x012a3620
H5F_block_read
??:0

0x0128d92c
H5D_mpio_select_read
??:0

0x0129071c
H5D_final_collective_io
H5Dmpio.c:0

0x0128e26c
H5D_inter_collective_io
H5Dmpio.c:0

0x0128e11c
H5D_contig_collective_io
??:0

0x0127f27c
H5D_contig_read
H5Dio.c:0

0x0127d5f4
H5D_read
H5Dio.c:0

0x0127cd24
H5Dread
??:0

0x01181d30
mhdf_readwrite
src/util.c:0

0x01182c9c
mhdf_read_data
??:0

0x0117d82c
mhdf_readSetDataWithOpt
??:0

0x0115474c
_ZN4moab8ReadHDF520find_sets_containingEiiiilRNS_5RangeE
??:0

0x0114f120
_ZN4moab8ReadHDF520find_sets_containingERNS_5RangeE
??:0

0x01147478
_ZN4moab8ReadHDF517load_file_partialEPKNS_11ReaderIface5IDTagEiiiRKNS_11FileOptionsE
??:0

0x01146728
_ZN4moab8ReadHDF59load_fileEPKcPKjRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE
??:0

0x010495f0
_ZN4moab4Core16serial_load_fileEPKcPKjRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE
??:0

0x011dfa44
_ZN4moab12ReadParallel9load_fileEPPKciPKjiRSsRSt6vectorIiSaIiEEbbSA_RKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoEibiiiiii
??:0

0x011de6b8
_ZN4moab12ReadParallel9load_fileEPPKciPKjRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE
??:0

0x01001a40
main
??:0

0x0166cdb8
generic_start_main
../csu/libc-start.c:231

0x0166d02c
__libc_start_main
../sysdeps/unix/sysv/linux/powerpc/libc-start.c:137

0xfffffffc
??
??:0

------------------------------------------------------------------------
+++ID TGID 706, Core 1, Thread 2 State 40000000, Sched: 08000000 

0x016a7070
clone
??:0

Bug #2

When reading with PARALLEL=BCAST_DELETE;PARTITION=PARALLEL_PARTITION;PARTITION_DISTRIBUTE;PARALLEL_RESOLVE_SHARED_ENTS;CPUTIME

one of the processors get a copy of the entire target mesh (could also be keeping entire source mesh). Example with 4 procs (see last line):

point location: wanted 31316 got 30885 locally, 431 remote, missing 0
point location: wanted 28794 got 27109 locally, 1685 remote, missing 0
point location: wanted 31722 got 31715 locally, 7 remote, missing 0
point location: wanted 134592 got 49339 locally, 85253 remote, missing 0

Pretty sure BCAST_DELETE was not doing this before.

Bug #3

Occurs with BCAST_DELETE and 128 procs, after first file gets read.

------------------------------------------------------------------------
+++ID TGID 115, Core 0, Thread 1 State 40000000, Sched: 48000000 

0x016a7070
H5HF_get_obj_len
??:0

------------------------------------------------------------------------
+++ID TGID 115, Core 0, Thread 5 State 00000000, Sched: 48000000 Running

0x010863f4
_ZN4moab14AdaptiveKDTree10build_treeERKNS_5RangeERjPKNS0_8SettingsE
/intrepid-fs0/users/acaceres/persistent/src/PM13-hdf1.8/src/AdaptiveKDTree.cpp:1367

0x01050424
_ZSt22__uninitialized_copy_aIN9__gnu_cxx17__normal_iteratorIPPN4moab7TagInfoESt6vectorIS4_SaIS4_EEEES9_S4_ET0_T_SB_SA_SaIT1_E
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_uninitialized.h:250

0x011a8bfc
_ZSt6fill_nIPdiiET_S1_T0_RKT1_
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:636

0x011a5c7c
_ZN9__gnu_cxx13new_allocatorISt13_Rb_tree_nodeISt4pairIKN4moab19abaqus_element_typeEjEEE8allocateEjPKv
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/ext/new_allocator.h:88

0x0119e4a0
_ZN4moab9WriteUtil24get_entity_list_pointersENS_5Range14const_iteratorES2_PPKjNS_14WriteUtilIface14EntityListTypeEPiPh
/intrepid-fs0/users/acaceres/persistent/src/PM13-hdf1.8/src/WriteUtil.cpp:917

0x0119be7c
_ZNSaIPN4moab7TagInfoEEC1ERKS2_
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/allocator.h:100

0x0119afd8
_ZNSaIiEC1ERKS_
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/allocator.h:99

0x0119a9c4
_ZSt8_DestroyIN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEEiEvT_S7_SaIT0_E
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_construct.h:182

0x011dfed0
_ZNSt6vectorIdSaIdEEC1IiEET_S3_RKS0_
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:256

0x011de6b8
_ZN4moab10ReadABAQUS17read_element_listEjj
/intrepid-fs0/users/acaceres/persistent/src/PM13-hdf1.8/src/io/ReadABAQUS.cpp:1303

0x01001a40
_ZNSaIPN4moab7TagInfoEED2Ev
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/gnu-linux/lib/gcc/powerpc-bgp-linux/4.1.2/../../../../include/c++/4.1.2/bits/allocator.h:105

0x0166cdb8
H5T_vlen_disk_setnull
H5Tvlen.c:0

0x0166d02c
H5T_vlen_reclaim_recurse
H5Tvlen.c:0

0xfffffffc
??
??:0

------------------------------------------------------------------------
+++ID TGID 115, Core 1, Thread 2 State 40000000, Sched: 08000000 

0x016a7070
H5HF_get_obj_len
??:0

------------------------------------------------------------------------
+++ID TGID 115, Core 2, Thread 3 State 40000000, Sched: 08000000 

0x016a7070
H5HF_get_obj_len
??:0

------------------------------------------------------------------------
+++ID TGID 115, Core 3, Thread 4 State 40000000, Sched: 08000000 

0x016a7070
H5HF_get_obj_len
??:0