Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

"starting wrf task 0 of 1" instead of "starting wrf task 0 of 4"

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

mcanonic

New member
Hi all,
I'm new in this field and I'm helping a colleague to run this model in the cloud. I have created a ubuntu Virtual Machine (VM) and I've followed the steps described here to install and configure all the software.

In a previous VM everything works fine. When we run:
mpirun -np 4 wrf.exe
the output was:
starting wrf task 0 of 4
starting wrf task 1 of 4
starting wrf task 3 of 4
starting wrf task 2 of 4

but int the new VM the same command produces:
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1

As I mentioned before, I have not much experience. I've tried to see the output files like rsl.error.0000 but I did not find any hints.

Could you suggest me where am I wrong?

The VM that I used has 16 VCPUs.

Thanks,
Massimo
 
Massimo,
It seems that only one processor is activated to run your case. My question here is:
How did you compile WRF (in serial, dmpar or smear mode)?
Did the case run to the end?
Is there any other error message in your rsl files or log file?
 
Ming Chen said:
Massimo,
It seems that only one processor is activated to run your case. My question here is:
How did you compile WRF (in serial, dmpar or smear mode)?
Did the case run to the end?
Is there any other error message in your rsl files or log file?

Hi Ming,
I followed the instruction available here:
https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compilation_tutorial.php#STEP2
and when I configure WRF, I select option 34:
32. (serial) 33. (smpar) 34. (dmpar) 35. (dm+sm) GNU (gfortran/gcc)

I re-run the ./configure (where I selected 34 and then 1),
and then the command ./compile em_real >& log.compile
but now for some reason I get this error:
---> Problems building executables, look for errors in the build log <---

I'm attaching the log.compile file, maybe you can help me.

Thanks again,
MView attachment log.compile
 
Massimo,
I guess you didn't type ./clean -a before you recompile the code. Please let me know if I am wrong.
./clean -a will remove all previously compiled codes. Without it, the old and new settings will be mixed and cause failure of compiling.
Please type ./cean -a, then recompile and save the log file for me to take a look.
By the way, are you working on Amazon clouds?
 
Thanks! With clean the exe files are back.
We use Chameleon project which use OpenStack as Cloud Platform. If you have any question about cloud computing, we can discuss privately.
So my colleague with the exe files do like this:

Code:
$:~/WRF/test/em_real$ mpirun -np 4 real.exe
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1


By running this command, i've got some errors:
Code:
$:~/WRF/test/em_real$ mpirun -np 4 wrf.exe&
[1] 19219
$:~/WRF/test/em_real$  
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1

Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51791,1],2]
  Exit code:    1
--------------------------------------------------------------------------
In the error file, what we got is this:

Code:
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  module_date_time.G  LINE:     910
WRFU_TimeSet() in wrf_atotime() FAILED   Routine returned error code =           -1
-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

Any suggestions?

Thanks,
M
 
(1) By mpirun -np 4, you should have four rsl files, --- is this what you have?
(2) By saying "Primary job terminated normally", do you mean the case run to the end? If not, how long did it integrate before it crashed?
(3) Which version of WRF are you using? Please send me your namelist.input to take a look.
 
Ming Chen said:
(1) By mpirun -np 4, you should have four rsl files, --- is this what you have?
(2) By saying "Primary job terminated normally", do you mean the case run to the end? If not, how long did it integrate before it crashed?
(3) Which version of WRF are you using? Please send me your namelist.input to take a look.

1) it creates just one file
2) That message is the output that I get by executing the command "mpirun -np 4 wrf.exe&"
3) WRF Model Version 4.2
View attachment namelist.output.txt
I'm attaching the file namelist.input.

I can provide you (or to who want take a look at the VM) access, I just need the public key.

Thanks for your help!
M
 
I don't think this problem is related to the model. It is a machine issue. I believe that either the machine lib or the environmental settings are wrong in this case, which leads to failed MPI run.
Please consult your computer manager or colleagues regarding the machine issue.
 
Top