Problems with wrf.exe while trying to execute on multiple nodes

This post was from a previous version of the WRF&MPAS-A Support Forum. Please do not add new replies here and if you would like the thread moved out of the Historical / Archive section then contact us, making sure to include the link of the thread to be moved.

j_nava

New member
I'm trying to run the ./wrf.exe command in an hpc using 13 nodes in order to decrease my computation time. In order to do that, I'm using the following command:

Code:
mpirun -machinefile hostfile.txt -np 150 ./wrf.exe

The contents of the hostfile.exe file are simply the names of the 12 nodes.

When I try to run the previous command, I get the following error:

Code:
control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@bright90] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@bright90] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@bright90] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

It's really wierd, since a similar command is being used by another team that is running the RegCM model in the hpc.


I would really appreciate any insights to solve this problem!!!
 

kwerner

Administrator
Staff member
Hi,
Unfortunately this issue is probably related to your particular environment, and does not have anything to do with the WRF model, as all the failure messages are specific to mpi. I suggest trying to get support from a systems administrator at your institution, and hopefully they can help you resolve the problem.
 

j_nava

New member
Hi kwerner, I was able to solve my first problem by reinstalling mpirun and restarting the nodes, and now I can get the command to work.

Nevertheless, now when I'm running wrd.exe apparently is using the master node in the cluster along side the other nodes. I was wondering if there's a way to prevent this from happening, so that the process would only run in the other nodes.

Would really appreciate your help.
 

Ming Chen

Moderator
Staff member
I wonder whether you can try the command:

mpirun -np 12 ./wrf.exe

If it doesn't work, then probably you will need to add machine file.

If either way doesn't work, I suppose this should be a machine-related issue.
 
Top