Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault with spectral nudging

ejanzon

New member
Hi!

I am running a real case with WRF v4.6 using forcing from ERA5. Up until now I have been running the model at 12 km horizontal resolution with a 200x150 grid point domain for 48 days; for the most part, WPS and WRF have been fine. Now I would like to run WRF with otherwise the same configuration, but with spectral nudging on T, uv, and ph. When I use spectral nudging (setting the coefficients for guv,tuv, etc to non-zero numbers), the model crashes with a segmentation fault in the first timestep in wrf.exe. Are there best practices for memory/processing when using spectral nudging? I am using 2 nodes with 32 total cpus on our HPC machines.

I include the namelists for my WPS, WRF runs and the rsl.error.0000 file, as well as the batch script for mpirun.

Thanks!

Erik
 

Attachments

  • namelist.input
    4.1 KB · Views: 11
  • namelist.wps
    1.2 KB · Views: 1
  • rsl.error.0000
    4.5 KB · Views: 2
Erik,

Your namelist.input looks fine. I am not sure yet what is wrong in your case. The model crashed immediately, which seems more like a data issue or a memory issue

Can you run this case with 4 nodes, and see how it works? This is to make sure whether memory is a problem.

If it still failed, I would tend to think it is a data issue. Please recompile WRF in debug mode, i.e., ./configure -D, and run this case again. By this way you will know when and where something goes wrong first, which will help to figure put the issue.
 
Erik,

Your namelist.input looks fine. I am not sure yet what is wrong in your case. The model crashed immediately, which seems more like a data issue or a memory issue

Can you run this case with 4 nodes, and see how it works? This is to make sure whether memory is a problem.

If it still failed, I would tend to think it is a data issue. Please recompile WRF in debug mode, i.e., ./configure -D, and run this case again. By this way you will know when and where something goes wrong first, which will help to figure put the issue.
Hi! Thank you for your response. I tried with more nodes and more processors and also more memory, but had the same problem with seg fault. Recompiling in debug mode, the model crashes in a different place on only a few processors (for example, see rsl.error.0034). Now, I get a floating point problem when calling atan in start_domain_em (apologies if I am interpreting this incorrectly...I am a bit new to WRF, so I am wondering if I compiled the model in debug mode properly). I am guessing the culprit is in line 1561 or 1562 of the start_domain_em subroutine when setting up slope radiation constant arrays.

Attached is a zip file with all the rsl errors when using 48 processors.

Thanks again,

Erik
 

Attachments

  • rslerrors.zip
    53.2 KB · Views: 1
I saw the errors like :
Code:
[c1409:2945488:0:2945488] Caught signal 8 (Floating point exception: floating-point divide by zero)
==== backtrace (tid:2945488) ====
 0  /util/opt/ucx/1.12.1/gcc/8.5.0/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x1503d4697c94]
 1  /util/opt/ucx/1.12.1/gcc/8.5.0/lib/libucs.so.0(+0x2ee94) [0x1503d4697e94]
..

Such error message cannot help to find what is wrong.

Please follow the steps below to recompile WRF:
./clean -a
./configure -D ( and choose the option)
./compile em_real

Then you can rerun this case and the RSL files will tell when and where soemthing goes wrong first.
 
Thanks for your response, Ming Chen.

I recompiled in debug mode again and made sure all of the new executables are in my working directory and I get the same output. I assume that I correctly compiled in debug mode, because there is different behavior: without debug mode, wrf.exe crashes in a seg fault. With debug mode, wrf crashes on one processor with the attached error. On line 94, it says something about start_domain_em, but doesn't provide information about the line of code where it crashes. Without debug mode, it just says segmentation fault on all processors and that it crashed in wrf.exe and the last call is SFCLAY.

I am a bit stumped: I tried a few things, like turning smooth_cg_topo on, making sure there is no spectral nudging in the boundary layer, and using the RDA ERA5 forcing data instead of from Copernicus, but I get the same behavior. One more thing I will try is to contact our computing center and see if my settings are correct in my job submission.
 

Attachments

  • rsl.error.0127.txt
    5.5 KB · Views: 3
Okay, I have a more useful update! Using advice from our computing center, I recompiled in debug mode using an Intel compiler. Now I have more useful information about the crash. It shows problems in module_sf_noahmplsm.f90.

I think I am having the same problems that were described here:


Is there a final fix for this?
 

Attachments

  • namelist.input (2).txt
    4.2 KB · Views: 9
  • rsl.error.0075.txt
    8.7 KB · Views: 2
Last edited:
I found the problem. I didn't choose the correct analysis interval in the fdda namelist, so I *think* what was happening was that something was defaulting to zero in the code, which was causing the divide by zero. I think it is fixed though...the simulation appears to be nudged toward the reanalysis.
 
Thank you for the update.

Please confirm that after you specified gfdda_interval_m and gfdda_end_h, the previous failed case is able to run.

Thanks.
 
I did specify gfdda_interval_m, but kept gfdda_end_h as default, which I assume is 6 hours. If I want my whole simulation nudged, for example, I can just set gfdda_end_h to the number of hours of my simulation, right?
 
Top