Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

SIGSEGV on SBM microphysics

htan2013

Member
Dear all experts:

I tried to use SBM (option 30 in mp_physics) to run the WRF on Derecho.
However, I continually met SIGSEGV fault when the model tried to generate the 2nd wrfout on the domain 2.

Things that I tried:
1. Increasing nodes from 10, 20, 40, and 60.
2. Reducing variables in output, only generating XLON, XLAT, and Times.
3. Debug mode. I can run the debug mode but it took more than 12 hours to get the crashed point that I have right now so I couldn't see what caused the SIGSEGV error exactly.

Really appreciate any advice on this. I have attached the namelist and rsl out file here. The experiment path on Derecho is /glade/derecho/scratch/htan2013/WRF4.6.0_SBM/test/em_real. The job script is WRF.sh.

Best regards,
HT
 

Attachments

  • namelist.input
    5.9 KB · Views: 4
Hi,
Apologies for the delay. We do not work on the weekends and it can often take us a few work days to be able to respond, due to other obligations. Thank you for your patience.

I notice that the resolution of the parent domain is 2km. What resolution is your input data (that you processed in ungrib)? The difference in resolution between the input data and your first domain should not be more than around a 5:1 ratio.

I also looked at the WRF.pbs script. It looks like you're trying to use threading for shared-memory processing, but the WRF code you have is compiled for distributed memory processing. Can you try this with commands for distributed-memory processing, following this syntax:

#PBS -l select=1:ncpus=128:mpiprocs=128

That line requests a single node, with 128 processors per node. For the size of your domain, you can use up to about 1500 total processors and still be within a reasonable number of processors.
 
Hi,
Apologies for the delay. We do not work on the weekends and it can often take us a few work days to be able to respond, due to other obligations. Thank you for your patience.

I notice that the resolution of the parent domain is 2km. What resolution is your input data (that you processed in ungrib)? The difference in resolution between the input data and your first domain should not be more than around a 5:1 ratio.

I also looked at the WRF.pbs script. It looks like you're trying to use threading for shared-memory processing, but the WRF code you have is compiled for distributed memory processing. Can you try this with commands for distributed-memory processing, following this syntax:

#PBS -l select=1:ncpus=128:mpiprocs=128

That line requests a single node, with 128 processors per node. For the size of your domain, you can use up to about 1500 total processors and still be within a reasonable number of processors.
Thanks for your reply! I have tried with "#PBS -l select=10:ncpus=128:mpiprocs=128" and "
mpiexec -n 1280 -ppn 128 ./wrf.exe", still seeing SIGSEGV
 
I tested your case on Derecho and by only making two very small changes, I was able to run the simulation to completion. First, I increased the time_step to 10, and then I set radt = 2, 2. I ran with 512 processors. If you'd like to take a look at anything I did, you can find it in
/glade/derecho/scratch/kkeene/htan2013/wrfv4.6.1/test/em_real

I used your wrfbdy_d01 and wrfinput* files. My wrf.exe batch script is called runwrf.sh.
 
Top