Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

How can I determine numbers of CPUs or processors for running wrf?

ZHAO

New member
HI,WRF forum!

I met a problem when I tried to run a wrf model. The task stops with the "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." in rsl.error file (attached). I found a similar case in the forum(SIGSEGV: Segmentation fault - invalid memory reference). Then I realise maybe my task also has incorrect numbers of processors. I also read the article "How many precessors should I use to run WRF"(How many processors should I use to run WRF?). However, I' m wondering how should I determine numbers of CPUs or processors for running wrf? Are there any parameters retavant? My WPS&WRF are installed by spack.

Thanks a lot.
 

Attachments

  • rsl.error.0000
    12.8 KB · Views: 12
Hi,
It looks like you're only running with a single processor, so you do need more than that. Are you not able to understand the FAQ to determine a reasonable number of processors to use for your domain size?
 
Thanks a lot for your reply. Yes I can understand the FAQ report you mentioned. As it said,

For your largest-sized domain:((e_we)/100) * ((e_sn)/100) = least amount of processors you should use.
For your smallest-sized domain:((e_we)/25) * ((e_sn)/25) = most amount of processors you should use.


In my namelist, e_we = 225 and e_sn = 125 and I have only one domain. So according to the rules I shoud as least use 3 to 125 processors, am I right?
But I do not know exactly how to control the numbers of processors while running the wrf model. For example, are there any parameters in 'namelist.input' I should change?
 
Hi,
During the period waiting for your reply, I slightly changed the time step from 150km to 125km and it turns out that real.exe was successfully finished but last night wrf.exe broke down amazingly 6 hours later after it started. The error is the same as I mentioned previously, Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Now I've updated namelist file and rsl.error file. Kindly look forward to to your suggetions.
Best
 

Attachments

  • rsl.error.0000
    397.2 KB · Views: 5
  • namelist.input
    3.6 KB · Views: 8
Yesterday I read the WRF tutorials again and I found a group in namelist named '&namelist_quilt' and parameters in it such as numtiles, nproc_x and nproc_y. Are they relavant to multi-processors?
 
Hi,
To run with multiple processors, you will need to make sure to compile the code with the dmpar (distributed memory) option. When you configure, you are given a list of compilers to choose from, and for each compiler, there should be an option for serial, smpar, dmpar, dm+sm, so you'll choose the dmpar option. To compile correctly, you'll need to make sure to have an MPI option (e.g., OpenMPI) installed. Then when you run WRF, you will use a command such as:
mpiexec -np 16 ./wrf.exe

Here, you're using MPI to execute the model and telling it to use 16 as the number of processors (np). You do not need to modify anything in the namelist and we don't recommend using quilting unless you have a lot of experience with computing. If you have questions about this for your specific environment, you should reach out to a systems administrator at your institution for help.

I assume the segmentation fault error you're getting is still related to not using the appropriate number of processors.
 
Hi, kwerner.

Unfortunately, wrf model broke down again 12 hours after it started. This time I used mprun -np 32 ./wrf.exe in the Shell. The error output information is as below,

At line 2744 of file module_cu_kfeta.f90 (unit = 98)
Fortran runtime error: Cannot open file 'fort.98': Permission denied

Error termination. Backtrace:
#0 0x7fb6d292b2ed in ???
#1 0x7fb6d292bed5 in ???
#2 0x7fb6d292c69d in ???
#3 0x7fb6d2a9eecd in ???
#4 0x7fb6d2aa5bdc in ???
#5 0x557a74162af5 in ???
#6 0x557a7416587f in ???
#7 0x557a7417e435 in ???
#8 0x557a73a26439 in ???
#9 0x557a731fc73f in ???
#10 0x557a72cacf33 in ???
#11 0x557a72b57eff in ???
#12 0x557a71dd2f64 in ???
#13 0x557a71d64eb7 in ???
#14 0x557a71d6481e in ???
#15 0x7fb6d1d6bc86 in ???
#16 0x557a71d64859 in ???
#17 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[7736,1],10]
Exit code: 2




I've uploaded relevant files again in the attachment. Noted that rsl.error.0000 is the latest txt file generated in the file system while others are not changed from the past experiments, so rsl.error.0000 is chosen to uploaded among 32 rsl files. But fatal called error is not recorded in this file.
Kindly ask for help.
 

Attachments

  • rsl.error.0000
    77.8 KB · Views: 5
  • namelist.wps
    910 bytes · Views: 0
  • namelist.input
    3.5 KB · Views: 5
  • rsl.out.0000
    131.2 KB · Views: 4
  • 1668071034198.png
    1668071034198.png
    83.8 KB · Views: 11
When I tried to run without mpi, it return an error still about segmentation.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
 

Attachments

  • rsl.error.0000
    2.7 KB · Views: 3
In the post (2 posts above) in which you sent your namelist files and rsl files, you stated that you ran with 32 processors, but the rsl.* files are showing that you're only using a single processor. This line at the top indicates that:
Code:
 Ntasks in X            1 , ntasks in Y            1
Did you compile WRF with the dmpar option? If so, and if you're still experiencing issues, I think you should talk to a systems administrator at your institution to understand how to correctly run the model utilizing multiple processors.
 
Thanks. I do report the problems to the administrator then now I have the quality to run by multi-processors. This time I still used 32 processors but failed again. Good news is that 32 processors are used. See the line at the top of the new rsl file:
Ntasks in X 4 , ntasks in Y 8
This time, multiple tasks can be found in the rsl file and CPUs are observed running with 'top' code. However, it still broke down. It seems that it is still about segmentation.
 

Attachments

  • rsl.error.0019.txt
    7.6 KB · Views: 6
Hi,
I'm so glad you were finally able to use the multiple processors. So now in the rsl file you sent, I see several CFL errors. For e.g.,

Code:
d01 1980-06-05_00:37:00           50  points exceeded cfl=2 in domain d01 at time 1980-06-05_00:37:00 hours
d01 1980-06-05_00:37:00  MAX AT i,j,k:          258         137          25  vert_cfl,w,d(eta)=   2.65275121       11.9074059       3.52381468E-02
d01 1980-06-05_00:37:00           59  points exceeded cfl=2 in domain d01 at time 1980-06-05_00:37:00 hours
d01 1980-06-05_00:37:00  MAX AT i,j,k:          258         137          25  vert_cfl,w,d(eta)=   2.68560123       11.7858677       3.52381468E-02
d01 1980-06-05_00:37:00           64  points exceeded cfl=2 in domain d01 at time 1980-06-05_00:37:00 hours
d01 1980-06-05_00:37:00  MAX AT i,j,k:          258         137          25  vert_cfl,w,d(eta)=   2.70503855       12.4193897       3.52381468E-02
d01 1980-06-05_00:38:50           73  points exceeded cfl=2 in domain d01 at time 1980-06-05_00:38:50 hours

This indicates the model has become unstable - typically due to complex terrain. Take a look at this FAQ that addresses CFL errors and options you can try.
 
Hello, may I ask how you implement multiple processors? I've also set it as mpirun -np 16 ./wrf.exe, but the information after running still shows "Ntasks in X 1, ntasks in Y 1."
 
Hello, may I ask how you implement multiple processors? I've also set it as mpirun -np 16 ./wrf.exe, but the information after running still shows "Ntasks in X 1, ntasks in Y 1."
Since your question is not the same as the initial question on this thread, please post as a new thread. Thanks.
 
Top