Segmentation Fault in WRF 4.3.3 Simulations Without CFL Errors

zhanyx

New member
Dear WRF Community,

I am encountering persistent segmentation fault errors in two WRF 4.3.3 simulation cases (case1 and case2) and would greatly appreciate your assistance.

Problem Description

Both simulations start at 00:00 UTC on July 18th.
  • Case1 crashes immediately after writing the output file for 05:00 UTC on July 20th
  • Case2 crashes immediately after writing the output file for 07:00 UTC on July 18th
Importantly, no CFL error messages appear in the rsl.out/rsl.error files for either case.

Background Information

The only differences between case1 and case2 are:
  1. Land use data
  2. d01 domain configuration
Initially, case2 was crashing at 16:00 UTC on July 18th with clear CFL errors in the rsl files. To resolve this, I shifted the d01 domain northward by several dozen grid points to increase the distance between d02 and the northern boundary of d01. This adjustment successfully eliminated the CFL error messages, but the segmentation fault persists.

Solutions Already Attempted

  • Confirmed that ulimit -s unlimited is properly set and effective
  • Increased debug_level values, but this only produced excessive irrelevant output without revealing the root cause
  • Considered recompiling in debug mode, but this would significantly increase runtime, and my computational resources are limited
If anyone has encountered similar issues or has suggestions on how to diagnose and resolve this segmentation fault, please share your insights.

Thank you very much for your time and help!

Best regards,

zhanyx
 

Attachments

Last edited:
Zhanyx,
Are you running this on a shared HPC where you have to submit a batch script? If so, is it possible that you're exceeding your wallclock limit on that machine? For case 1, it looks like you should have a restart file at 2018-07-20_00:00:00. If so, can you try running a restart from that time and see if it gets you past the time when WRF is failing?

Another issue could be that you're running out of disk space. It's unlikely, but check on that to make sure.

I also notice you're not using a lot of processors, given the size of your domains. I know you mentioned your computational resources are limited, but if it's possible, trying to use more processors should at least speed up the computation time.
 
Zhanyx,
Are you running this on a shared HPC where you have to submit a batch script? If so, is it possible that you're exceeding your wallclock limit on that machine? For case 1, it looks like you should have a restart file at 2018-07-20_00:00:00. If so, can you try running a restart from that time and see if it gets you past the time when WRF is failing?

Another issue could be that you're running out of disk space. It's unlikely, but check on that to make sure.

I also notice you're not using a lot of processors, given the size of your domains. I know you mentioned your computational resources are limited, but if it's possible, trying to use more processors should at least speed up the computation time.
Thank you very much for your reply. I have successfully resolved the issue by reducing the time step from 6dx to 4dx.

I did not initially try decreasing the time step because I was not receiving any CFL error messages. However, it now appears that time step-related instabilities do not necessarily result in CFL violation errors, or alternatively, the CFL errors were occurring but not being logged in the rsl files.

I apologize for any inconvenience this may have caused you and for taking up your valuable time. Thank you again sincerely for your kind assistance.
 
Back
Top