forrtl: error (78) and Timestep

Henry18 · Dec 1, 2025

Hi,

I'm running WRF and repeatedly encounter model crashes (before walltime expires). The rsl.error.* files show the same error on either the initial run or a restart:

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 000014557F256900 Unknown Unknown Unknown
wrf.exe 00000000025DAAF8 Unknown Unknown Unknown
wrf.exe 00000000025D52CD Unknown Unknown Unknown
wrf.exe 0000000001D8FF5A Unknown Unknown Unknown
wrf.exe 0000000001F790C9 Unknown Unknown Unknown
wrf.exe 00000000017292FB Unknown Unknown Unknown
wrf.exe 00000000014FBAE8 Unknown Unknown Unknown
wrf.exe 00000000005B97B3 Unknown Unknown Unknown
wrf.exe 00000000004174B1 Unknown Unknown Unknown
wrf.exe 0000000000417471 Unknown Unknown Unknown
wrf.exe 000000000041740D Unknown Unknown Unknown
libc.so.6 000014557F23FE6C Unknown Unknown Unknown
libc.so.6 000014557F23FF35 __libc_start_main Unknown Unknown
wrf.exe 000000000041733A Unknown Unknown Unknown

I found a forum thread suggesting that reducing the timestep can resolve this error (forrtl: error (78): process killed (SIGTERM)). Although I haven't observed CFL warnings, I tried reducing the timestep: for a 7-day run I started at 18 and then reduced it on a restart at day 4 — testing 12, 9, and finally 6.

My questions: 1. Is it common practice in WRF to gradually reduce the timestep during a long simulation or across restarts? 2. For long-term simulations with multiple restarts, should I expect to need very small timesteps eventually (e.g., 3 s or 1 s)?

For reference, I’ve attached two rsl.error.* files. My working directory is: /glade/derecho/scratch/hhou/Test_ERA5/WRF/test/em_real

Thank you for your help!
Henry

William.Hatheway · Dec 1, 2025

Bash:

grep -i FATAL rsl.*

grep -i error rsl.*

grep -i SIGSEGV rsl.*

grep -i cfl rsl.*

run these commands in the /run folder that has all the rsl.out and rsl.error files and see if it comes back with anything.

Then upload those files here in a zip file

Henry18 · Dec 2, 2025

William.Hatheway said:
Bash:

grep -i FATAL rsl.* grep -i error rsl.* grep -i SIGSEGV rsl.* grep -i cfl rsl.*

run these commands in the /run folder that has all the rsl.out and rsl.error files and see if it comes back with anything.

Then upload those files here in a zip file

Hi William,

Thank you for the reply! I ran the diagnostic commands in my WRF run directory. These two returned lots of information:

grep -i FATAL rsl.*

grep -i error rsl.*

I have attached the output of the two commands in .txt files (named according to the respective command).

For context: I submitted a new simulation last night with the following settings, and it has now been running successfully for more than 4 hours:

#PBS -l select=16:ncpus=36:mpiprocs=36:mem=64GB

And the time_step = 6.

Because this new run is progressing so far, I’m not certain whether the log files I checked (and attached) exactly correspond to the previous crash, or if the issue has been inadvertently resolved by restarting with a clean environment.

Thank you!
Henry

kwerner · Dec 3, 2025

@Henry18
When your run completes, can you let me know if this issue is resolved? Thanks!

Henry18 · Dec 4, 2025

kwerner said:
@Henry18
When your run completes, can you let me know if this issue is resolved? Thanks!

Hi Kwerner,

Thanks for your help. After setting the time_step to 6, the previous run was able to finish with multiple restarts.

I’ve now adjusted the domain size so that the area ratio between d01 and d02 is close to 3:1, and I started a new run using a time_step of 15 (since the d01 resolution is 3 km) along with updated &physics parameters. However, I’m still encountering model crashes. After a successful 15-minute walltime trial run yesterday, I increased the walltime to 10 hours and submitted the job — but the model only ran for about 10 minutes before crashing again (even shorter than the trial run). I checked the rsl.error.* files, but they don’t indicate anything specific (I’ve attached one here):

Timing for main: time 2023-08-20_10:53:00 on domain 1: 0.48155 elapsed seconds
Timing for main: time 2023-08-20_10:53:05 on domain 2: 0.11420 elapsed seconds
Timing for main: time 2023-08-20_10:53:10 on domain 2: 0.11512 elapsed seconds
Timing for main: time 2023-08-20_10:53:15 on domain 2: 0.11698 elapsed seconds
Timing for main: time 2023-08-20_10:53:15 on domain 1: 0.48213 elapsed seconds
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 000014A099856900 Unknown Unknown Unknown
libfabric.so.1.18 000014A0962828D1 Unknown Unknown Unknown
libfabric.so.1.18 000014A09625891D Unknown Unknown Unknown
libfabric.so.1.18 000014A09625E0A9 Unknown Unknown Unknown
libfabric.so.1.18 000014A096239A41 Unknown Unknown Unknown
libmpi_intel.so.1 000014A09ACF1BF2 Unknown Unknown Unknown
libmpi_intel.so.1 000014A09AD25AD6 Unknown Unknown Unknown
libmpi_intel.so.1 000014A09AD26896 MPI_Wait Unknown Unknown
wrf.exe 00000000035A5E74 Unknown Unknown Unknown
wrf.exe 000000000180E780 Unknown Unknown Unknown
wrf.exe 0000000001705E09 Unknown Unknown Unknown
wrf.exe 00000000014FBAE8 Unknown Unknown Unknown
wrf.exe 00000000005B97B3 Unknown Unknown Unknown
wrf.exe 00000000004174B1 Unknown Unknown Unknown
wrf.exe 0000000000417471 Unknown Unknown Unknown
wrf.exe 000000000041740D Unknown Unknown Unknown
libc.so.6 000014A09983FE6C Unknown Unknown Unknown
libc.so.6 000014A09983FF35 __libc_start_main Unknown Unknown
wrf.exe 000000000041733A Unknown Unknown Unknown

I used the following setting for the new run:

#PBS -l select=2:ncpus=128:mpiprocs=128

And my working directory is still: /glade/derecho/scratch/hhou/Test_ERA5/WRF/test/em_real

I’m now trying time_step = 12. This repeated crashing has been troubling me for two weeks. Could you help me take a look at what might be causing the issue?

Thanks!
Haoran

kwerner · Dec 4, 2025

Hi Haoran,
I was able to run your exact case without issues on Derecho, and with a time_step of 15 (I only ran for 4 hours to save on computation). The differences in our runs are 1) I only ran with 100 processors, and 2) I'm using WRFv4.7.1. So you can try those two things to see if it makes a difference. If you're interested in looking around at my run, you can find it in /glade/derecho/scratch/kkeene/henry18/wrfv4.7.1/test/em_real

Henry18 · Dec 4, 2025

kwerner said:
Hi Haoran,
I was able to run your exact case without issues on Derecho, and with a time_step of 15 (I only ran for 4 hours to save on computation). The differences in our runs are 1) I only ran with 100 processors, and 2) I'm using WRFv4.7.1. So you can try those two things to see if it makes a difference. If you're interested in looking around at my run, you can find it in /glade/derecho/scratch/kkeene/henry18/wrfv4.7.1/test/em_real

Hi Kwerner,

Thank you for your suggestion, I definitely will try your method! May I ask why you run with 100 processors, or more broadly, how can you determine the number of processors should be used in a run?

Best,
Henry

kwerner · Dec 4, 2025

If you haven't already seen Choosing an Appropriate Number of Processors, that can be helpful in choosing the appropriate amount. You didn't use too many, based on the algorithm. Honestly, I just tried to use fewer to save on computation, and it happened to work okay. It may not have anything to do with your issue, but it's worth a try.

Henry18 · Dec 4, 2025

kwerner said:
If you haven't already seen Choosing an Appropriate Number of Processors, that can be helpful in choosing the appropriate amount. You didn't use too many, based on the algorithm. Honestly, I just tried to use fewer to save on computation, and it happened to work okay. It may not have anything to do with your issue, but it's worth a try.

Hi Kwerner,

Thank you. I updated my WRF version to 4.7.1 and now the model runs well! I will follow up if unexpected crashes happen again.

Henry

Henry18 · Dec 5, 2025

kwerner said:
Hi Haoran,
I was able to run your exact case without issues on Derecho, and with a time_step of 15 (I only ran for 4 hours to save on computation). The differences in our runs are 1) I only ran with 100 processors, and 2) I'm using WRFv4.7.1. So you can try those two things to see if it makes a difference. If you're interested in looking around at my run, you can find it in /glade/derecho/scratch/kkeene/henry18/wrfv4.7.1/test/em_real

Hi Kwerner,

Thank you for your advice, after update my WRF to version 4.7.1, the model crash problem was successfully solved, I finished the simulation just in one run without crash and restart!

Best,
Henry

kwerner · Dec 5, 2025

Henry,
That's great news! I'm glad to hear it, and thank you for updating this thread.

forrtl: error (78) and Timestep

Henry18

New member

Attachments

William.Hatheway

Well-known member

Henry18

New member

Attachments

kwerner

Administrator

Henry18

New member

Attachments

kwerner

Administrator

Henry18

New member

kwerner

Administrator

Henry18

New member

Henry18

New member

kwerner

Administrator