Ticket #711 (accepted bug)
Parallel builds in MPICH2
| Reported by: | Joe Ratterman <jratt0@…> | Owned by: | goodell |
|---|---|---|---|
| Priority: | minor | Milestone: | mpich2-1.3 |
| Component: | mpich2 | Keywords: | |
| Cc: | smithbr@…, archerc@…, jratt@… |
Description (last modified by goodell) (diff)
I'm looking into building more of our system in parallel, but I've seen
lib*.a corruption in MPICH2 before. When reviewing build output in -j2, I
will often see things like this:
$ pwd
/bgusr/jratt/bgp.01/bgp/comm/build/mpich2/dcmf-8aint/src/mpi/pt2pt
$ touch /bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
$ make -s # NORMAL:
CC
/bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
AR cr ../../../lib/libmpich.cnk.a send.o
RANLIB libmpich.cnk.a
CC
/bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
AR cr ../../../lib/libmpich.cnk.la send.lo
$ touch /bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
$ make -j -s # ODD:
CC
/bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
* AR cr ../../../lib/libmpich.cnk.a send.o*
* AR cr ../../../lib/libmpich.cnk.a send.o*
* RANLIB libmpich.cnk.a*
* RANLIB libmpich.cnk.a*
CC
/bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
AR cr ../../../lib/libmpich.cnk.la send.lo
In parallel, it appears to be running AR and RANLIB twice. Adding $$$$ to
the echo commands in the Makefile (shell $$, the pid), I see this:
$ touch /bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
$ make -j -s
CC
/bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
* AR cr ../../../lib/libmpich.cnk.a send.o 12867*
* AR cr ../../../lib/libmpich.cnk.a send.o 12868*
* RANLIB libmpich.cnk.a 12871*
* RANLIB libmpich.cnk.a 12873*
CC
/bgusr/jratt/bgp.01/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/send.c
AR cr ../../../lib/libmpich.cnk.la send.lo
It appears to be running the commands simultaneously in two processes. That
cannot be a good thing. I think it comes from this sort of section in the
Makefile.in:
*all-redirect: .libstamp0 ../../../lib/lib${MPILIBNAME}.a*
*.libstamp0 ../../../lib/lib${MPILIBNAME}.a*: bsend.o bsend_init.o
bufattach.o \
buffree.o cancel.o greq_start.o greq_complete.o ibsend.o iprobe.o \
irecv.o irsend.o isend.o issend.o probe.o recv.o recv_init.o \
request_free.o request_get_status.o rsend.o rsend_init.o send.o \
send_init.o sendrecv.o sendrecv_rep.o status_set_cancelled.o ssend.o \
ssend_init.o start.o startall.o test.o test_cancelled.o testall.o \
testany.o testsome.o wait.o waitall.o waitany.o waitsome.o \
bsendutil.o mpir_request.o
@if [ "x$(VERBOSE)" != "x1" ] ; then \
echo " AR cr ../../../lib/lib${MPILIBNAME}.a $?" ; \
else \
echo ${AR} cr ../../../lib/lib${MPILIBNAME}.a $? ; \
fi
@${AR} cr ../../../lib/lib${MPILIBNAME}.a $?
@if [ "x$(VERBOSE)" != "x1" ] ; then \
echo " RANLIB lib${MPILIBNAME}.a" ; \
else \
echo ${RANLIB} ../../../lib/lib${MPILIBNAME}.a ; \
fi
@${RANLIB} ../../../lib/lib${MPILIBNAME}.a
date > .libstamp0
all-redirect is the default rule in use. It depends on two things, which
don't depend on each other, so make appears to parallelize the building of
those two things. However, those two things aren't independent and in fact
build each other. Removing the library as a dependency all over the
Makefile removes the double-AR and the build still works with dep checking.
I think this simple Makefile illustrates the problem well:
all-redirect: stamp lib
stamp lib: source
sleep 1
echo $@ > stamp
echo $@ > lib
Drop that in an (empty) dir and run:
touch source ; make -j ; head stamp lib
You can see that it updates stamp and lib twice simultaneously after
sleeping 1 sec:
sleep 1
sleep 1
echo stamp > stamp
echo lib > stamp
echo stamp > lib
echo lib > lib
That will clearly result in non-deterministic data in lib and stamp.
To fix parallel builds, I believe that steps need to be taken to remove
those sort of double-target rules. I understand that the Makefile.in files
are built by "simplemake", but that is the end of my understanding. Is
there someone who still knows how it works and can help get parallel builds
working?
Thanks,
Joe Ratterman
jratt@us.ibm.com
Attachments
Change History
Note: See
TracTickets for help on using
tickets.
