Ticket #363 (closed bug: fixed)

Opened 10 months ago

Last modified 4 months ago

Re: MPI_IN_PLACE bug in Allgatherv in MPE's collchk

Reported by: kakollu@… Owned by: chan
Priority: blocker Milestone: mpich2-1.1.1
Component: mpich2 Keywords:
Cc:

Description (last modified by chan) (diff)


----- "Satyanarayana Kakollu" <kakollu@gmail.com> wrote:

> Hi Anthony,
> Is it safe to use MPI_ALLGATHERV with MPI_IN_PLACE in fortran?
>
> Should we just use the recv buffer as send buffer instead of
> MPI_IN_PLACE?
>
> Thanks,
> Satya
>
>
>
> On Tue, Jan 6, 2009 at 4:45 PM, Anthony Chan <chan@mcs.anl.gov>
> wrote:
>
> >
> > Hi Satyanarayana,
> >
> > The support of MPI_IN_PLACE for Allgatherv in CollChk library
> > is definitely in 1.0.6p1.  My simple test program didn't reveal
> > any problem.  If your program is small, could you send it to
> > me so I can check if the collchk library contains any bug ?
> >
> > Thanks,
> > A.Chan
> >
> > ----- "Anthony Chan" <chan@mcs.anl.gov> wrote:
> >
> > > ----- "Rajeev Thakur" <thakur@mcs.anl.gov> wrote:
> > >
> > > > That might be a bug in the collchk library. If sendbuf is
> > > MPI_IN_PLACE
> > > > in
> > > > Allgatherv, the sendcount argument should be ignored.
> > > >
> > > > Rajeev
> > > >
> > > >
> > > >
> > > >   _____
> > > >
> > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com]
> > > > Sent: Friday, December 19, 2008 9:53 AM
> > > > To: Anthony Chan
> > > > Cc: Rajeev Thakur
> > > > Subject: Re: Trouble with MPI_BCAST
> > > >
> > > >
> > > > Thank you Rajeev and Anthony,
> > > >
> > > > -mpe=mpicheck give the following message at an MPI_ALL_GATHERV
> call
> > > > in our
> > > > code.
> > > >
> > > > ALLGATHERV (Rank 0) --> Inconsistent datatype signatures
> detected
> > > > between
> > > > local rank 0
> > > >
> > > > I am using the MPI_IN_PLACE option with send count set as '0',
> can
> > > > this be
> > > > the problem ?
> > > >
> > > > Satya
> > > >
> > > > On Wed, Dec 17, 2008 at 10:02 PM, Anthony Chan
> <chan@mcs.anl.gov>
> > > > wrote:
> > > >
> > > >
> > > >
> > > > Or use "mpicc -mpe=mpicheck" or "mpif90 -mpe=mpicheck" as
> linker.
> > > >
> > > > A.Chan
> > > >
> > > >
> > > > ----- "Rajeev Thakur" <thakur@mcs.anl.gov> wrote:
> > > >
> > > > > Satya,
> > > > >            Try linking with -lmpe_collchk. It will run MPE's
> > > > > collective call
> > > > > checker to see if there is any discrepancy in the parameters
> > > passed
> > > > > to
> > > > > MPI_Bcast. If that doesn't show any errors, try running a
> simple
> > > > test
> > > > > program that contains only the broadcast.
> > > > >
> > > > > Rajeev
> > > > >
> > > > >
> > > > >
> > > > >   _____
> > > > >
> > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com]
> > > > > Sent: Tuesday, December 16, 2008 5:31 PM
> > > > > To: Rajeev Thakur
> > > > > Subject: Trouble with MPI_BCAST
> > > > >
> > > > >
> > > > > Rajeev,
> > > > >
> > > > > We are seeing that our code is getting stuck at MPI_BCAST on
> a
> > > > > customer
> > > > > machine. The call simple, all ranks use same size buffer and
> > > count,
> > > > > we
> > > > > verified that the root is same on all ranks.
> > > > >
> > > > > The code works on our clusters, but not on the user's
> machine.
> > > Here
> > > > > are the
> > > > > differences between our clusters and the user's machine.
> > > > >
> > > > >
> > > > > Our clusters                         User's machine
> > > > >
> > > > > Multi-proc nodes                   Single SMP node with 8
> cores on
> > > > > two
> > > > > sockets.
> > > > > CentOS 4, RHEL 4                RHEL 5 client version
> > > > > mpich2 1.0.6p1                     mpich2 1.0.6p1 (same)
> > > > >
> > > > > We were using gdb to localize the bug to MPI_BCAST two of the
> 8
> > > > ranks
> > > > > do not
> > > > > get past the BCAST. If we replace the BCAST with PT2PT
> > > > communication
> > > > > it is
> > > > > running well for 1000s of iterations.
> > > > >
> > > > > We linked our applications statically, on the RHEL 4 machine.
> > > > >
> > > > > Can you share your first thoughts about the issue.
> > > > >
> > > > > Thanks,
> > > > > Satya
> >

Attachments

Change History

Changed 10 months ago by Anthony Chan

  • id set to 363

This message has 0 attachment(s)

Changed 10 months ago by chan

  • reporter changed from Anthony Chan <chan@…> to kakollu@…

Changed 10 months ago by chan

  • owner set to chan
  • description modified (diff)

Yes. It is safe to use MPI_IN_PLACE in fortran with MPI_Allgatherv.

The bug is actually in MPE's fortran to C wrapper interface and I am currently working on a bugfix. In the mean time, you can modify <mpich2-1.0.8-install-dir>/etc/mpe_f77env.conf, comment the line that reference -lmpe_f2cmpi as follows

# MPE_F2CMPI_LIBS="-lmpe_f2cmpi" MPE_F2CMPI_LIBS=${MPE_F2CMPI_LIBS:="-lfmpich"}

Then you can use MPE collchk library, i.e. -mpe=mpicheck.

Changed 10 months ago by Anthony Chan


Yes. It is safe to use MPI_IN_PLACE in fortran with MPI_Allgatherv.

The bug is actually in MPE's fortran to C wrapper interface and
I am currently working on a bugfix. In the mean time, you can
modify <mpich2-1.0.8-install-dir>/etc/mpe_f77env.conf,
comment the line that reference -lmpe_f2cmpi as follows

# MPE_F2CMPI_LIBS="-lmpe_f2cmpi"
MPE_F2CMPI_LIBS=${MPE_F2CMPI_LIBS:="-lfmpich"}

Then you can use MPE collchk library, i.e. -mpe=mpicheck.

A.Chan

----- "mpich2" <mpich2-maint@mcs.anl.gov> wrote:

> -------------------------------+--------------------------------------------
>  Reporter:  kakollu@gmail.com  |        Owner:  chan
>      Type:  bug                |       Status:  new
>  Priority:  major              |    Milestone:
> Component:  mpich2             |   Resolution:
>  Keywords:                     |
> -------------------------------+--------------------------------------------
>
> Changes (by chan):
>
>   * owner:  => chan
>
>
> Old description:
>
> > {{{
> >
>
> > ----- "Satyanarayana Kakollu" <kakollu@gmail.com> wrote:
> >
> > > Hi Anthony,
> > > Is it safe to use MPI_ALLGATHERV with MPI_IN_PLACE in fortran?
> > >
> > > Should we just use the recv buffer as send buffer instead of
> > > MPI_IN_PLACE?
> > >
> > > Thanks,
> > > Satya
> > >
> > >
> > >
> > > On Tue, Jan 6, 2009 at 4:45 PM, Anthony Chan <chan@mcs.anl.gov>
> > > wrote:
> > >
> > > >
> > > > Hi Satyanarayana,
> > > >
> > > > The support of MPI_IN_PLACE for Allgatherv in CollChk library
> > > > is definitely in 1.0.6p1.  My simple test program didn't reveal
> > > > any problem.  If your program is small, could you send it to
> > > > me so I can check if the collchk library contains any bug ?
> > > >
> > > > Thanks,
> > > > A.Chan
> > > >
> > > > ----- "Anthony Chan" <chan@mcs.anl.gov> wrote:
> > > >
> > > > > ----- "Rajeev Thakur" <thakur@mcs.anl.gov> wrote:
> > > > >
> > > > > > That might be a bug in the collchk library. If sendbuf is
> > > > > MPI_IN_PLACE
> > > > > > in
> > > > > > Allgatherv, the sendcount argument should be ignored.
> > > > > >
> > > > > > Rajeev
> > > > > >
> > > > > >
> > > > > >
> > > > > >   _____
> > > > > >
> > > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com]
> > > > > > Sent: Friday, December 19, 2008 9:53 AM
> > > > > > To: Anthony Chan
> > > > > > Cc: Rajeev Thakur
> > > > > > Subject: Re: Trouble with MPI_BCAST
> > > > > >
> > > > > >
> > > > > > Thank you Rajeev and Anthony,
> > > > > >
> > > > > > -mpe=mpicheck give the following message at an
> MPI_ALL_GATHERV
> > > call
> > > > > > in our
> > > > > > code.
> > > > > >
> > > > > > ALLGATHERV (Rank 0) --> Inconsistent datatype signatures
> > > detected
> > > > > > between
> > > > > > local rank 0
> > > > > >
> > > > > > I am using the MPI_IN_PLACE option with send count set as
> '0',
> > > can
> > > > > > this be
> > > > > > the problem ?
> > > > > >
> > > > > > Satya
> > > > > >
> > > > > > On Wed, Dec 17, 2008 at 10:02 PM, Anthony Chan
> > > <chan@mcs.anl.gov>
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Or use "mpicc -mpe=mpicheck" or "mpif90 -mpe=mpicheck" as
> > > linker.
> > > > > >
> > > > > > A.Chan
> > > > > >
> > > > > >
> > > > > > ----- "Rajeev Thakur" <thakur@mcs.anl.gov> wrote:
> > > > > >
> > > > > > > Satya,
> > > > > > >            Try linking with -lmpe_collchk. It will run
> MPE's
> > > > > > > collective call
> > > > > > > checker to see if there is any discrepancy in the
> parameters
> > > > > passed
> > > > > > > to
> > > > > > > MPI_Bcast. If that doesn't show any errors, try running a
> > > simple
> > > > > > test
> > > > > > > program that contains only the broadcast.
> > > > > > >
> > > > > > > Rajeev
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >   _____
> > > > > > >
> > > > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com]
> > > > > > > Sent: Tuesday, December 16, 2008 5:31 PM
> > > > > > > To: Rajeev Thakur
> > > > > > > Subject: Trouble with MPI_BCAST
> > > > > > >
> > > > > > >
> > > > > > > Rajeev,
> > > > > > >
> > > > > > > We are seeing that our code is getting stuck at MPI_BCAST
> on
> > > a
> > > > > > > customer
> > > > > > > machine. The call simple, all ranks use same size buffer
> and
> > > > > count,
> > > > > > > we
> > > > > > > verified that the root is same on all ranks.
> > > > > > >
> > > > > > > The code works on our clusters, but not on the user's
> > > machine.
> > > > > Here
> > > > > > > are the
> > > > > > > differences between our clusters and the user's machine.
> > > > > > >
> > > > > > >
> > > > > > > Our clusters                         User's machine
> > > > > > >
> > > > > > > Multi-proc nodes                   Single SMP node with 8
> > > cores on
> > > > > > > two
> > > > > > > sockets.
> > > > > > > CentOS 4, RHEL 4                RHEL 5 client version
> > > > > > > mpich2 1.0.6p1                     mpich2 1.0.6p1 (same)
> > > > > > >
> > > > > > > We were using gdb to localize the bug to MPI_BCAST two of
> the
> > > 8
> > > > > > ranks
> > > > > > > do not
> > > > > > > get past the BCAST. If we replace the BCAST with PT2PT
> > > > > > communication
> > > > > > > it is
> > > > > > > running well for 1000s of iterations.
> > > > > > >
> > > > > > > We linked our applications statically, on the RHEL 4
> machine.
> > > > > > >
> > > > > > > Can you share your first thoughts about the issue.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Satya
> > > >
> > }}}
>
> New description:
>
>  {{{
>
>
>  ----- "Satyanarayana Kakollu" <kakollu@gmail.com> wrote:
>
>  > Hi Anthony,
>  > Is it safe to use MPI_ALLGATHERV with MPI_IN_PLACE in fortran?
>  >
>  > Should we just use the recv buffer as send buffer instead of
>  > MPI_IN_PLACE?
>  >
>  > Thanks,
>  > Satya
>  >
>  >
>  >
>  > On Tue, Jan 6, 2009 at 4:45 PM, Anthony Chan <chan@mcs.anl.gov>
>  > wrote:
>  >
>  > >
>  > > Hi Satyanarayana,
>  > >
>  > > The support of MPI_IN_PLACE for Allgatherv in CollChk library
>  > > is definitely in 1.0.6p1.  My simple test program didn't reveal
>  > > any problem.  If your program is small, could you send it to
>  > > me so I can check if the collchk library contains any bug ?
>  > >
>  > > Thanks,
>  > > A.Chan
>  > >
>  > > ----- "Anthony Chan" <chan@mcs.anl.gov> wrote:
>  > >
>  > > > ----- "Rajeev Thakur" <thakur@mcs.anl.gov> wrote:
>  > > >
>  > > > > That might be a bug in the collchk library. If sendbuf is
>  > > > MPI_IN_PLACE
>  > > > > in
>  > > > > Allgatherv, the sendcount argument should be ignored.
>  > > > >
>  > > > > Rajeev
>  > > > >
>  > > > >
>  > > > >
>  > > > >   _____
>  > > > >
>  > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com]
>  > > > > Sent: Friday, December 19, 2008 9:53 AM
>  > > > > To: Anthony Chan
>  > > > > Cc: Rajeev Thakur
>  > > > > Subject: Re: Trouble with MPI_BCAST
>  > > > >
>  > > > >
>  > > > > Thank you Rajeev and Anthony,
>  > > > >
>  > > > > -mpe=mpicheck give the following message at an
> MPI_ALL_GATHERV
>  > call
>  > > > > in our
>  > > > > code.
>  > > > >
>  > > > > ALLGATHERV (Rank 0) --> Inconsistent datatype signatures
>  > detected
>  > > > > between
>  > > > > local rank 0
>  > > > >
>  > > > > I am using the MPI_IN_PLACE option with send count set as
> '0',
>  > can
>  > > > > this be
>  > > > > the problem ?
>  > > > >
>  > > > > Satya
>  > > > >
>  > > > > On Wed, Dec 17, 2008 at 10:02 PM, Anthony Chan
>  > <chan@mcs.anl.gov>
>  > > > > wrote:
>  > > > >
>  > > > >
>  > > > >
>  > > > > Or use "mpicc -mpe=mpicheck" or "mpif90 -mpe=mpicheck" as
>  > linker.
>  > > > >
>  > > > > A.Chan
>  > > > >
>  > > > >
>  > > > > ----- "Rajeev Thakur" <thakur@mcs.anl.gov> wrote:
>  > > > >
>  > > > > > Satya,
>  > > > > >            Try linking with -lmpe_collchk. It will run
> MPE's
>  > > > > > collective call
>  > > > > > checker to see if there is any discrepancy in the
> parameters
>  > > > passed
>  > > > > > to
>  > > > > > MPI_Bcast. If that doesn't show any errors, try running a
>  > simple
>  > > > > test
>  > > > > > program that contains only the broadcast.
>  > > > > >
>  > > > > > Rajeev
>  > > > > >
>  > > > > >
>  > > > > >
>  > > > > >   _____
>  > > > > >
>  > > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com]
>  > > > > > Sent: Tuesday, December 16, 2008 5:31 PM
>  > > > > > To: Rajeev Thakur
>  > > > > > Subject: Trouble with MPI_BCAST
>  > > > > >
>  > > > > >
>  > > > > > Rajeev,
>  > > > > >
>  > > > > > We are seeing that our code is getting stuck at MPI_BCAST
> on
>  > a
>  > > > > > customer
>  > > > > > machine. The call simple, all ranks use same size buffer
> and
>  > > > count,
>  > > > > > we
>  > > > > > verified that the root is same on all ranks.
>  > > > > >
>  > > > > > The code works on our clusters, but not on the user's
>  > machine.
>  > > > Here
>  > > > > > are the
>  > > > > > differences between our clusters and the user's machine.
>  > > > > >
>  > > > > >
>  > > > > > Our clusters                         User's machine
>  > > > > >
>  > > > > > Multi-proc nodes                   Single SMP node with 8
>  > cores on
>  > > > > > two
>  > > > > > sockets.
>  > > > > > CentOS 4, RHEL 4                RHEL 5 client version
>  > > > > > mpich2 1.0.6p1                     mpich2 1.0.6p1 (same)
>  > > > > >
>  > > > > > We were using gdb to localize the bug to MPI_BCAST two of
> the
>  > 8
>  > > > > ranks
>  > > > > > do not
>  > > > > > get past the BCAST. If we replace the BCAST with PT2PT
>  > > > > communication
>  > > > > > it is
>  > > > > > running well for 1000s of iterations.
>  > > > > >
>  > > > > > We linked our applications statically, on the RHEL 4
> machine.
>  > > > > >
>  > > > > > Can you share your first thoughts about the issue.
>  > > > > >
>  > > > > > Thanks,
>  > > > > > Satya
>  > >
>  }}}
> Comment:
>
>  Yes.  It is safe to use MPI_IN_PLACE in fortran with MPI_Allgatherv.
>
>  The bug is actually in MPE's fortran to C wrapper interface and I
>  am currently working on a bugfix.  In the mean time, you can modify
>  <mpich2-1.0.8-install-dir>/etc/mpe_f77env.conf, comment the line
>  that reference -lmpe_f2cmpi as follows
>
>  # MPE_F2CMPI_LIBS="-lmpe_f2cmpi"
>  MPE_F2CMPI_LIBS=${MPE_F2CMPI_LIBS:="-lfmpich"}
>
>  Then you can use MPE collchk library, i.e. -mpe=mpicheck.
>
> --
> Ticket URL:
> <https://trac.mcs.anl.gov/projects/mpich2/ticket/363#comment:3>

Changed 8 months ago by chan

  • milestone set to mpich2-1.1b2

Added milestone to make sure the fix will be done by then.

Changed 6 months ago by chan

  • priority changed from major to blocker

Changed 6 months ago by chan

  • milestone changed from mpich2-1.1rc1 to mpich2-1.1

This isn't likely to happen for 1.1rc1. Move it to 1.1.

Changed 5 months ago by chan

  • milestone changed from mpich2-1.1 to mpich2-1.1.1

Changed 4 months ago by chan

  • status changed from new to closed
  • resolution set to fixed

The libmpe_f2cmpi.a works with inplacef.f, statuses.f as well as FLASH3 and NPB-MZ (tested on P). In this fix, mpe's configure does not need a cross-spec file for cross compilers. All AC_TRY_RUNs are replaced with runtime tests in initialization time. Resolving...

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #363 (Re: MPI_IN_PLACE bug in Allgatherv in MPE's collchk)

Author



Change Properties
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.