Ticket #355 (new bug)

Opened 10 months ago

Last modified 3 weeks ago

RE: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers

Reported by: "Jayesh Krishna" <jayesh@…> Owned by: jayesh
Priority: major Milestone: mpich2-1.3
Component: mpich2 Keywords:
Cc: gucigu@…

Description (last modified by jayesh) (diff)

Hi,

 From the error codes in the hostname tests it looks like Computer1 (Where
the shared network folder resides) is unable to handle the number of
connections to it.

############ Error code desc from MS ############

ERROR_REQ_NOT_ACCEP (71 0x47) : No more connections can be made to this
remote computer at this time because there are already as many connections
as the computer can accept.

############ Error code desc from MS ############

 We should retry (but we do not) in this case.

 Can you verify that the existing network mapped drive connections are
cleanedup in all the machines (Type "net use" in a command prompt on each
machine to view the existing network mapped conns)?

Regards,

Jayesh


  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Tuesday, January 13, 2009 3:21 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers



Dear Community!

I started testng with the exampel cpi.exe program (so the problem is not
in my program). I run the following command for all computers X=(1..8) and
everything worked ok:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -hosts 1 ComputerX -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe

Than I ran the following command:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -n X -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
X:\CPI\cpi.exe

Note: I also changed the machines.txt file as you suggested (adding :1).

The result was the following for X up to 5 it worked ok (I did only one
test run). But when I tested with X=6 (aka. on 6 computers). I got the
following error:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer6' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer3' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer5' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer4' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

I got the same error as on the first run.

And this errors were repeating on and on and on ... most of the times the
error with only one computer and in most cases it was the second computer
in the machinefile list. But not necesary. When there were more than one
launch failed errors (like in second case) the order could be also
different. In 20 tries not one was successfull.

Than just for kicks I tried with X=8 I got the same errors with random
number of launch failed errors and  more or less random ComputerX that
reported this.

But every now or than I got one of the following errors (after the list of
launch failed errors):
1)
unable to post a write for the next command,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_writev(1768): An established connection was aborted by the
software in your host machine. (errno 10053)
unable to post a write of the close command to tear down the job tree as
part of the abort process.
unable to post an abort command.
2)
unable to post a read for the next command header,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_readv(1656): An existing connection was forcibly closed by
the remote host. (errno 10054)
unable to post a read for the next command on left context.
3)
unable to read the cmd header on the left context, socket connection
closed.


Hope this info helps

Regards

P.S.: I tried a couple of runs with X=5 and got mixed results, on some
runs it worked ok on some it did not. Basically the same as with my
program. So I would still say, as the number of computers increases, the
problem gets worse.

P.P.S.: Almost forgot to test the hostname. Here are the results of two
runs.

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer4
computer1
computer8
computer2
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer7
computer5
computer3
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer6

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer3
computer7
computer5
computer1
computer4
computer8
computer2
computer6



2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>


Hi,
# Do you get any error message related to mapping network drives when you
ran your job ?
 Please provide us with the command+output of your MPI job (Copy-paste
your complete mpiexec command and its output in your email).

# Can you run a command like (Note that I have removed "-noprompt"
option),

        mpiexec -map x:\\computer1\MPI -wdir x:\ -n 8 -machinefile
testallnamesmf.txt hostname

  with the following contents in the machinefile (testallnamesmf.txt -
contains all the computer/host names - Note that I specify that only 1 MPI
process be launched on each host using "hostname:1" syntax),

computer1:1 -ifhn 192.168.1.1
computer2:1 -ifhn 192.168.1.2
...
computer8:1 -ifhn 192.168.1.8

# Does your program fail consistently for certain computers ? Try running
a simple job (mpiexec -map x:\\computer1\MPI -wdir x:\ -n 1 -machinefile
testmf.txt hostname) with only specifying 1 computer/host at a time.

# Try removing "-noprompt" from the mpiexec command and see if mpiexec
prompts you for anything (password, inputs etc).

Regards,
Jayesh

  _____

From: mpich-discuss-bounces@mcs.anl.gov
[mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Tina Tina
Sent: Tuesday, January 13, 2009 12:01 PM
To: mpich-discuss@mcs.anl.gov
Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I am using the latest version of MPICH2 for Windows (the problem occurs
also on 1.0.8). I have 8 computers connected over giga-bit switch. I have
written a program that uses MPI for paralelization. When I run a program
on one or two computers. Everything works OK (lets say most of the time).
When I run it on 4 computers, sometimes it works and sometimes it does
not. The error that I get is:
launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed,
error 3 - The system cannot find the path specified.

Most times I get this error for one computer in machine list, but it can
also happen for 2 or more computers etc.

If I increase number of computers over 4. I get this error almost every
time. With 6 or more this happens every time. It looks like the higher the
number the worse it gets. I would really like to make this work. Has
anybody had such experiences and what was the solution.

It looks like the computer tries to start the program before the mapped
drive would be made operational. Is there any way to increase this delay?
Or are there any other settings that needs to be set?

There are some other errors that I occasionally get, but this is the most
important one (for now).

Systems:
Windows XP SP3 (on all computers)
Installed latest MPICH2
Connection giga-bit NICs (local network) over switch

Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map
X:\\computer1\MPI -wdir X:\ -n 4 -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe

\\computer1\MPI is a shared folder on computer1 from which the command is
run

machines.txt consists of following lines:
computer1 -ifhn 192.168.1.1
computer2 -ifhn 192.168.1.2
...
computer8 -ifhn 192.168.1.8

These are the NICs I would like MPI to use them for communication. The
order of computers in machines.txt is irrelevant (it happens on every
combination).

Regards



Attachments

part0001.html (12.9 KB) - added by Jayesh Krishna 10 months ago.
Added by email2trac
part0001.2.html (15.7 KB) - added by Jayesh Krishna 10 months ago.
Added by email2trac
part0001.3.html (21.8 KB) - added by Jayesh Krishna 10 months ago.
Added by email2trac

Change History

Changed 10 months ago by Jayesh Krishna

Added by email2trac

Changed 10 months ago by Jayesh Krishna

  • id set to 355

This message has 1 attachment(s)

Changed 10 months ago by jayesh

  • owner set to jayesh

Changed 10 months ago by Jayesh Krishna

Added by email2trac

Changed 10 months ago by Jayesh Krishna

Hi,
 As per the microsoft support article below there is a limit of 5 incoming
connections for Win XP Home and a limit of 10 incoming connections for Win
XP Pro.

http://support.microsoft.com/kb/314882

 Are you using Win XP Home ?

Regards,
Jayesh

  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Tuesday, January 13, 2009 3:46 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Hi,

Yes, all connections were cleaned up. But I already have one share
\\computer1\share1$ that is shared among the rest of the computers. Is it
posible that Windows XP SP3 have some limitations regarding the number of
active shared connections. If so do you maybe know how to raise it?

In any case if this is true than this would be a big limitation of usage
of MPICH2 on Windows (if there is no way to raise this limit). Tommorrow I
will do some testing regarding this and let you know what I found out. Of
course if you have the solution to this problem ... do not hesitate to
tell me. ;-)

In any case thanks for your support.

Regards


2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>



Hi,

 From the error codes in the hostname tests it looks like Computer1 (Where
the shared network folder resides) is unable to handle the number of
connections to it.

############ Error code desc from MS ############

ERROR_REQ_NOT_ACCEP (71 0x47) : No more connections can be made to this
remote computer at this time because there are already as many connections
as the computer can accept.

############ Error code desc from MS ############

 We should retry (but we do not) in this case.

 Can you verify that the existing network mapped drive connections are
cleanedup in all the machines (Type "net use" in a command prompt on each
machine to view the existing network mapped conns)?

Regards,

Jayesh



  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Tuesday, January 13, 2009 3:21 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I started testng with the exampel cpi.exe program (so the problem is not
in my program). I run the following command for all computers X=(1..8) and
everything worked ok:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -hosts 1 ComputerX -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe

Than I ran the following command:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -n X -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
X:\CPI\cpi.exe

Note: I also changed the machines.txt file as you suggested (adding :1).

The result was the following for X up to 5 it worked ok (I did only one
test run). But when I tested with X=6 (aka. on 6 computers). I got the
following error:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer6' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer3' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer5' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer4' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

I got the same error as on the first run.

And this errors were repeating on and on and on ... most of the times the
error with only one computer and in most cases it was the second computer
in the machinefile list. But not necesary. When there were more than one
launch failed errors (like in second case) the order could be also
different. In 20 tries not one was successfull.

Than just for kicks I tried with X=8 I got the same errors with random
number of launch failed errors and  more or less random ComputerX that
reported this.

But every now or than I got one of the following errors (after the list of
launch failed errors):
1)
unable to post a write for the next command,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_writev(1768): An established connection was aborted by the
software in your host machine. (errno 10053)
unable to post a write of the close command to tear down the job tree as
part of the abort process.
unable to post an abort command.
2)
unable to post a read for the next command header,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_readv(1656): An existing connection was forcibly closed by
the remote host. (errno 10054)
unable to post a read for the next command on left context.
3)
unable to read the cmd header on the left context, socket connection
closed.


Hope this info helps

Regards

P.S.: I tried a couple of runs with X=5 and got mixed results, on some
runs it worked ok on some it did not. Basically the same as with my
program. So I would still say, as the number of computers increases, the
problem gets worse.

P.P.S.: Almost forgot to test the hostname. Here are the results of two
runs.

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer4
computer1
computer8
computer2
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer7
computer5
computer3
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer6

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer3
computer7
computer5
computer1
computer4
computer8
computer2
computer6



2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>


Hi,
# Do you get any error message related to mapping network drives when you
ran your job ?
 Please provide us with the command+output of your MPI job (Copy-paste
your complete mpiexec command and its output in your email).

# Can you run a command like (Note that I have removed "-noprompt"
option),

        mpiexec -map x:\\computer1\MPI -wdir x:\ -n 8 -machinefile
testallnamesmf.txt hostname

  with the following contents in the machinefile (testallnamesmf.txt -
contains all the computer/host names - Note that I specify that only 1 MPI
process be launched on each host using "hostname:1" syntax),

computer1:1 -ifhn 192.168.1.1
computer2:1 -ifhn 192.168.1.2
...
computer8:1 -ifhn 192.168.1.8

# Does your program fail consistently for certain computers ? Try running
a simple job (mpiexec -map x:\\computer1\MPI -wdir x:\ -n 1 -machinefile
testmf.txt hostname) with only specifying 1 computer/host at a time.

# Try removing "-noprompt" from the mpiexec command and see if mpiexec
prompts you for anything (password, inputs etc).

Regards,
Jayesh

  _____

From: mpich-discuss-bounces@mcs.anl.gov
[mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Tina Tina
Sent: Tuesday, January 13, 2009 12:01 PM
To: mpich-discuss@mcs.anl.gov
Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I am using the latest version of MPICH2 for Windows (the problem occurs
also on 1.0.8). I have 8 computers connected over giga-bit switch. I have
written a program that uses MPI for paralelization. When I run a program
on one or two computers. Everything works OK (lets say most of the time).
When I run it on 4 computers, sometimes it works and sometimes it does
not. The error that I get is:
launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed,
error 3 - The system cannot find the path specified.

Most times I get this error for one computer in machine list, but it can
also happen for 2 or more computers etc.

If I increase number of computers over 4. I get this error almost every
time. With 6 or more this happens every time. It looks like the higher the
number the worse it gets. I would really like to make this work. Has
anybody had such experiences and what was the solution.

It looks like the computer tries to start the program before the mapped
drive would be made operational. Is there any way to increase this delay?
Or are there any other settings that needs to be set?

There are some other errors that I occasionally get, but this is the most
important one (for now).

Systems:
Windows XP SP3 (on all computers)
Installed latest MPICH2
Connection giga-bit NICs (local network) over switch

Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map
X:\\computer1\MPI -wdir X:\ -n 4 -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe

\\computer1\MPI is a shared folder on computer1 from which the command is
run

machines.txt consists of following lines:
computer1 -ifhn 192.168.1.1
computer2 -ifhn 192.168.1.2
...
computer8 -ifhn 192.168.1.8

These are the NICs I would like MPI to use them for communication. The
order of computers in machines.txt is irrelevant (it happens on every
combination).

Regards




Changed 10 months ago by Jayesh Krishna

Hi,
 Try installing Windows Server OSes on your machines. Windows XP * is only
meant for client machines.
 Your program could run slower on 2 machines than 4 depending on the
nature of your code and the configuration of the machines (How much memory
does your code use ? How much RAM, hard disk space is available on each
machine ? ... Try a program like "Process Explorer" - -
http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx - to get
more information).
 Are you using threads in your MPI program ? Can you provide us with a
test case (the smaller the better) that shows the problem mentioned below
?

Regards,
Jayesh

  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Thursday, January 15, 2009 8:53 AM
To: Jayesh Krishna
Subject: Re: [mpich2-maint] #355: RE: [mpich-discuss] MPICH2 1.1a2 -
problems with more than 4 computers


Hi!

I did some testing and have found the following results.

If I try to run on more than 5 hosts (let say 6) this does not work,
because IPC$ takes 6 connections and there are only 4 left for the share
itself. I do not know how and even if it is possible to turn of IPC$?

I have a dual processor machines so I decided to try to run computer1:2
configurations. This way I would be able to run 8 instances on 4 machines.
But I have the following problem. Very often I get this error (or smth
very similar):
[01:2644].....ERROR:result command received but the wait_list is empty.
[01:2644]...ERROR:unable to handle the command: "cmd=result src=1 dest=1
tag=12 cmd_tag=3 cmd_orig=dbput ctx_key=0 result=DBS_SUCCESS "
[01:3424].....ERROR:result command received but the wait_list is empty.
[01:3424]...ERROR:unable to handle the command: "cmd=result src=1 dest=1
tag=13 cmd_tag=3 cmd_orig=dbput ctx_key=1 result=DBS_SUCCESS "
[01:3424]..ERROR:sock_op_close returned while unknown context is in state:
SMPD_IDLE
[01:2644]..ERROR:sock_op_close returned while unknown context is in state:
SMPD_IDLE

or

[01:2872].....ERROR:result command received but the wait_list is empty.
[01:2872]...ERROR:unable to handle the command: "cmd=result src=1 dest=1
tag=30 cmd_tag=8 cmd_orig=dbget ctx_key=0 value="port=1049
description=192.168.1.11 if name=192.168.1.11 " result=DBS_SUCCESS "
[01:2872]..ERROR:sock_op_close returned while unknown context is in state:
SMPD_IDLE

or

...

In every case I have to terminate it by hand (Ctrl + c)!

And if the run is succesfull the time needed is much greater if I would
run it on separate machines.
For example:
if I run n=4 on 4 machines the time is like 20 seconds, but if I run n=4
on 2 machines (but with 2 processes per node) the time is like 30 seconds?
So about 50% or more slower. I would assume that this would be faster and
not slower?

Do you have any explanation for this and the upper errors?

Regards.


2009/1/13 Tina Tina <gucigu@gmail.com>


Hi!

I use Windows XP Pro (on all machines). This would mean that I would need
to reduce/kill all connections on the main computer to test this. Is there
a way to "unlimit" this. I noticed a patch for tcpip.sys, but I am not
sure if this is the same connection limit (probably not)?
Is there a posibility in MPICH config to set a "maping" delay to some
number for such cases. Because I would not mind waiting for eastablishing
the mappinga s little longer (so the 10 connection limit would not be
broken)? Like every x ms a new map would be initiated. Would this solve
this problem?


Regards


2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>


Hi,
 As per the microsoft support article below there is a limit of 5 incoming
connections for Win XP Home and a limit of 10 incoming connections for Win
XP Pro.

http://support.microsoft.com/kb/314882

 Are you using Win XP Home ?

Regards,
Jayesh

  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Tuesday, January 13, 2009 3:46 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Hi,

Yes, all connections were cleaned up. But I already have one share
\\computer1\share1$ that is shared among the rest of the computers. Is it
posible that Windows XP SP3 have some limitations regarding the number of
active shared connections. If so do you maybe know how to raise it?

In any case if this is true than this would be a big limitation of usage
of MPICH2 on Windows (if there is no way to raise this limit). Tommorrow I
will do some testing regarding this and let you know what I found out. Of
course if you have the solution to this problem ... do not hesitate to
tell me. ;-)

In any case thanks for your support.

Regards


2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>



Hi,

 From the error codes in the hostname tests it looks like Computer1 (Where
the shared network folder resides) is unable to handle the number of
connections to it.

############ Error code desc from MS ############

ERROR_REQ_NOT_ACCEP (71 0x47) : No more connections can be made to this
remote computer at this time because there are already as many connections
as the computer can accept.

############ Error code desc from MS ############

 We should retry (but we do not) in this case.

 Can you verify that the existing network mapped drive connections are
cleanedup in all the machines (Type "net use" in a command prompt on each
machine to view the existing network mapped conns)?

Regards,

Jayesh



  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Tuesday, January 13, 2009 3:21 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I started testng with the exampel cpi.exe program (so the problem is not
in my program). I run the following command for all computers X=(1..8) and
everything worked ok:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -hosts 1 ComputerX -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe

Than I ran the following command:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -n X -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
X:\CPI\cpi.exe

Note: I also changed the machines.txt file as you suggested (adding :1).

The result was the following for X up to 5 it worked ok (I did only one
test run). But when I tested with X=6 (aka. on 6 computers). I got the
following error:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer6' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer3' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer5' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer4' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

I got the same error as on the first run.

And this errors were repeating on and on and on ... most of the times the
error with only one computer and in most cases it was the second computer
in the machinefile list. But not necesary. When there were more than one
launch failed errors (like in second case) the order could be also
different. In 20 tries not one was successfull.

Than just for kicks I tried with X=8 I got the same errors with random
number of launch failed errors and  more or less random ComputerX that
reported this.

But every now or than I got one of the following errors (after the list of
launch failed errors):
1)
unable to post a write for the next command,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_writev(1768): An established connection was aborted by the
software in your host machine. (errno 10053)
unable to post a write of the close command to tear down the job tree as
part of the abort process.
unable to post an abort command.
2)
unable to post a read for the next command header,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_readv(1656): An existing connection was forcibly closed by
the remote host. (errno 10054)
unable to post a read for the next command on left context.
3)
unable to read the cmd header on the left context, socket connection
closed.


Hope this info helps

Regards

P.S.: I tried a couple of runs with X=5 and got mixed results, on some
runs it worked ok on some it did not. Basically the same as with my
program. So I would still say, as the number of computers increases, the
problem gets worse.

P.P.S.: Almost forgot to test the hostname. Here are the results of two
runs.

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer4
computer1
computer8
computer2
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer7
computer5
computer3
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer6

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)

*********** Warning ************
computer3
computer7
computer5
computer1
computer4
computer8
computer2
computer6



2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>


Hi,
# Do you get any error message related to mapping network drives when you
ran your job ?
 Please provide us with the command+output of your MPI job (Copy-paste
your complete mpiexec command and its output in your email).

# Can you run a command like (Note that I have removed "-noprompt"
option),

        mpiexec -map x:\\computer1\MPI -wdir x:\ -n 8 -machinefile
testallnamesmf.txt hostname

  with the following contents in the machinefile (testallnamesmf.txt -
contains all the computer/host names - Note that I specify that only 1 MPI
process be launched on each host using "hostname:1" syntax),

computer1:1 -ifhn 192.168.1.1
computer2:1 -ifhn 192.168.1.2
...
computer8:1 -ifhn 192.168.1.8

# Does your program fail consistently for certain computers ? Try running
a simple job (mpiexec -map x:\\computer1\MPI -wdir x:\ -n 1 -machinefile
testmf.txt hostname) with only specifying 1 computer/host at a time.

# Try removing "-noprompt" from the mpiexec command and see if mpiexec
prompts you for anything (password, inputs etc).

Regards,
Jayesh

  _____

From: mpich-discuss-bounces@mcs.anl.gov
[mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Tina Tina
Sent: Tuesday, January 13, 2009 12:01 PM
To: mpich-discuss@mcs.anl.gov
Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I am using the latest version of MPICH2 for Windows (the problem occurs
also on 1.0.8). I have 8 computers connected over giga-bit switch. I have
written a program that uses MPI for paralelization. When I run a program
on one or two computers. Everything works OK (lets say most of the time).
When I run it on 4 computers, sometimes it works and sometimes it does
not. The error that I get is:
launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed,
error 3 - The system cannot find the path specified.

Most times I get this error for one computer in machine list, but it can
also happen for 2 or more computers etc.

If I increase number of computers over 4. I get this error almost every
time. With 6 or more this happens every time. It looks like the higher the
number the worse it gets. I would really like to make this work. Has
anybody had such experiences and what was the solution.

It looks like the computer tries to start the program before the mapped
drive would be made operational. Is there any way to increase this delay?
Or are there any other settings that needs to be set?

There are some other errors that I occasionally get, but this is the most
important one (for now).

Systems:
Windows XP SP3 (on all computers)
Installed latest MPICH2
Connection giga-bit NICs (local network) over switch

Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map
X:\\computer1\MPI -wdir X:\ -n 4 -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe

\\computer1\MPI is a shared folder on computer1 from which the command is
run

machines.txt consists of following lines:
computer1 -ifhn 192.168.1.1
computer2 -ifhn 192.168.1.2
...
computer8 -ifhn 192.168.1.8

These are the NICs I would like MPI to use them for communication. The
order of computers in machines.txt is irrelevant (it happens on every
combination).

Regards






Changed 10 months ago by Jayesh Krishna

Added by email2trac

Changed 8 months ago by balaji

  • cc gucigu@… added
  • milestone set to mpich2-1.1rc1

Changed 7 months ago by jayesh

  • milestone changed from mpich2-1.1rc1 to mpich2-1.1.1

Moving this ticket to MPICH2 1.1.1 since we are too close to the rc1 release.

Regards, Jayesh

Changed 5 months ago by jayesh

  • milestone changed from mpich2-1.1.1 to mpich2-1.1.2

We need to add code to retry if mapping a shared drive fails. Since we are too close to 1.1.1, moving this ticket to 1.1.2

-Jayesh

Changed 3 months ago by balaji

  • milestone changed from mpich2-1.1.2 to mpich2-1.2

Milestone mpich2-1.1.2 deleted

Changed 3 weeks ago by jayesh

  • description modified (diff)
  • milestone changed from mpich2-1.2.1 to mpich2-1.3

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #355 (RE: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers)

Author



Change Properties
Action
as new
 
Note: See TracTickets for help on using tickets.