Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Problems with wrf.exe while trying to execute on multiple nodes

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

j_nava

New member
I'm trying to run the ./wrf.exe command in an hpc using 13 nodes in order to decrease my computation time. In order to do that, I'm using the following command:

Code:
mpirun -machinefile hostfile.txt -np 150 ./wrf.exe

The contents of the hostfile.exe file are simply the names of the 12 nodes.

When I try to run the previous command, I get the following error:

Code:
control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@bright90] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@bright90] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@bright90] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

It's really wierd, since a similar command is being used by another team that is running the RegCM model in the hpc.


I would really appreciate any insights to solve this problem!!!
 
Hi,
Unfortunately this issue is probably related to your particular environment, and does not have anything to do with the WRF model, as all the failure messages are specific to mpi. I suggest trying to get support from a systems administrator at your institution, and hopefully they can help you resolve the problem.
 
Hi kwerner, I was able to solve my first problem by reinstalling mpirun and restarting the nodes, and now I can get the command to work.

Nevertheless, now when I'm running wrd.exe apparently is using the master node in the cluster along side the other nodes. I was wondering if there's a way to prevent this from happening, so that the process would only run in the other nodes.

Would really appreciate your help.
 
I wonder whether you can try the command:

mpirun -np 12 ./wrf.exe

If it doesn't work, then probably you will need to add machine file.

If either way doesn't work, I suppose this should be a machine-related issue.
 
Top