Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

mpirun is not work for cluster

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Hachi-ait

New member
Dear all,

I has faced to the problem with mpirun. I compile successful the WRF v4.2 in my cluster, and try test with wrf.exe, it's successful but the speed of running is too slow, 1:3 (take 2 hours to simulate for 6 hours).

My cluster has 2 nodes, 8 cores per each, RAM 6GB per node. I have tried
> mpirun -np 8 ./wrf.exe
or
> mpirun -np 16 ./wrf.exe
The running show only 1 processor used (as in the image below)
101866551_956062121499182_2286707864758124544_n.png

And an error come with the message:
mpirun noticed that process rank 0 with PID 0 on node hpc exited on signal 9 (Killed)

This error mean the memory of system is not enough, but when I run wrf.exe without mpirun, it used only 2GB out of 6GB.
Do you have any suggestions for me to overcome this problem?

Thanks
HC
 
Hi HC,
(1) I suppose you build WRF in dmpar mode, please let me know if I am wrong.
(2) Please make sure the mpirun command is correct for the MPI installed in your machine
(3) I am also suspicious that this is possibly a communication issue between nodes/processors of your machine.

Can you try to run a simple MPI test job? Hope that will give you some clues.
 
Dear Ming Chen,
Thank you very much for your reply,
1) yes, I built WRF in dmpar mode, option 34 (dmpar with gfortran compiler)
2) my system has openMPI and MPICH3. the mpirun is default to link to openMPI. and I think the command is correct.
3) Yes, I think your words are correct. I'm trying to find the solution for it.
And when I tried with mpirun -np 4 ./wrf.exe. It's can run and consume almost the RAM memory in the frontend node.
I think my command is not call the link from other nodes to run, that's why when I run with 8 cores, the mpirun exit on signal 9.

May I ask another question: Why I set -np 4, but it still "starting wrf task 0 of 1" like the picture below? I suppose it should be starting wrf task 0 of 4.
1.png

Thanks
HC
 
HC,
This is what I said in the previous answer. This is an issue related to MPI. It seems that MPI is not correctly activated. Possible reasons are:

(1) the code was built with an MPI that is not being used by the mpirun command.
(2) something wrong on the MPI installation
 
Thank you Ming Chen,
Finally I fixed the problem, which requires mount /home directory in the frontend to the other nodes' /home directory.
Now it can run in three nodes, and divide task into number of processors, e.g. task 0 of 12, task 1 of 12.

However, the simulation speed seems not to be improved much compared to the case running on 1 processor only (task 0 of 1). When I check the rsl.out.0000 and rsl.out.000n of other tasks, the tasks were only divided for the beginning and ending processes (processing input, writing outputs, etc), but the tasks of main (e.g. Timing for main: time 2017-12-25_00:00:10 on domain 3: 8.82681 elapsed seconds) were not divided, and these task only recorded in rsl.out.0000.
Is there any way to divide the task of main for other processors?
I compiled WRF in frontend node only, Do I need to do so in other slaver nodes?

Thank you very much
 
HC,
I suppose you are talking of the computing efficiency, is this right?
Note that the communication between processors will take time. And the improvement is usually more significant for cases with large grid numbers.
 
Top