Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

forrtl: error (78) and Timestep

Henry18

New member
Hi,

I'm running WRF and repeatedly encounter model crashes (before walltime expires). The rsl.error.* files show the same error on either the initial run or a restart:

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 000014557F256900 Unknown Unknown Unknown
wrf.exe 00000000025DAAF8 Unknown Unknown Unknown
wrf.exe 00000000025D52CD Unknown Unknown Unknown
wrf.exe 0000000001D8FF5A Unknown Unknown Unknown
wrf.exe 0000000001F790C9 Unknown Unknown Unknown
wrf.exe 00000000017292FB Unknown Unknown Unknown
wrf.exe 00000000014FBAE8 Unknown Unknown Unknown
wrf.exe 00000000005B97B3 Unknown Unknown Unknown
wrf.exe 00000000004174B1 Unknown Unknown Unknown
wrf.exe 0000000000417471 Unknown Unknown Unknown
wrf.exe 000000000041740D Unknown Unknown Unknown
libc.so.6 000014557F23FE6C Unknown Unknown Unknown
libc.so.6 000014557F23FF35 __libc_start_main Unknown Unknown
wrf.exe 000000000041733A Unknown Unknown Unknown

I found a forum thread suggesting that reducing the timestep can resolve this error (forrtl: error (78): process killed (SIGTERM)). Although I haven't observed CFL warnings, I tried reducing the timestep: for a 7-day run I started at 18 and then reduced it on a restart at day 4 — testing 12, 9, and finally 6.

My questions: 1. Is it common practice in WRF to gradually reduce the timestep during a long simulation or across restarts? 2. For long-term simulations with multiple restarts, should I expect to need very small timesteps eventually (e.g., 3 s or 1 s)?

For reference, I’ve attached two rsl.error.* files. My working directory is: /glade/derecho/scratch/hhou/Test_ERA5/WRF/test/em_real

Thank you for your help!
Henry
 

Attachments

  • rsl.error.00001.txt
    5.4 KB · Views: 1
  • rsl.error.00002.txt
    977.1 KB · Views: 0
Bash:
grep -i FATAL rsl.*

grep -i error rsl.*

grep -i SIGSEGV rsl.*

grep -i cfl rsl.*

run these commands in the /run folder that has all the rsl.out and rsl.error files and see if it comes back with anything.

Then upload those files here in a zip file
 
Bash:
grep -i FATAL rsl.*

grep -i error rsl.*

grep -i SIGSEGV rsl.*

grep -i cfl rsl.*

run these commands in the /run folder that has all the rsl.out and rsl.error files and see if it comes back with anything.

Then upload those files here in a zip file
Hi William,

Thank you for the reply! I ran the diagnostic commands in my WRF run directory. These two returned lots of information:

grep -i FATAL rsl.*

grep -i error rsl.*

I have attached the output of the two commands in .txt files (named according to the respective command).

For context: I submitted a new simulation last night with the following settings, and it has now been running successfully for more than 4 hours:

#PBS -l select=16:ncpus=36:mpiprocs=36:mem=64GB

And the time_step = 6.

Because this new run is progressing so far, I’m not certain whether the log files I checked (and attached) exactly correspond to the previous crash, or if the issue has been inadvertently resolved by restarting with a clean environment.

Thank you!
Henry
 

Attachments

  • error message.zip
    45.5 KB · Views: 1
@Henry18
When your run completes, can you let me know if this issue is resolved? Thanks!
Hi Kwerner,

Thanks for your help. After setting the time_step to 6, the previous run was able to finish with multiple restarts.

I’ve now adjusted the domain size so that the area ratio between d01 and d02 is close to 3:1, and I started a new run using a time_step of 15 (since the d01 resolution is 3 km) along with updated &physics parameters. However, I’m still encountering model crashes. After a successful 15-minute walltime trial run yesterday, I increased the walltime to 10 hours and submitted the job — but the model only ran for about 10 minutes before crashing again (even shorter than the trial run). I checked the rsl.error.* files, but they don’t indicate anything specific (I’ve attached one here):

Timing for main: time 2023-08-20_10:53:00 on domain 1: 0.48155 elapsed seconds
Timing for main: time 2023-08-20_10:53:05 on domain 2: 0.11420 elapsed seconds
Timing for main: time 2023-08-20_10:53:10 on domain 2: 0.11512 elapsed seconds
Timing for main: time 2023-08-20_10:53:15 on domain 2: 0.11698 elapsed seconds
Timing for main: time 2023-08-20_10:53:15 on domain 1: 0.48213 elapsed seconds
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 000014A099856900 Unknown Unknown Unknown
libfabric.so.1.18 000014A0962828D1 Unknown Unknown Unknown
libfabric.so.1.18 000014A09625891D Unknown Unknown Unknown
libfabric.so.1.18 000014A09625E0A9 Unknown Unknown Unknown
libfabric.so.1.18 000014A096239A41 Unknown Unknown Unknown
libmpi_intel.so.1 000014A09ACF1BF2 Unknown Unknown Unknown
libmpi_intel.so.1 000014A09AD25AD6 Unknown Unknown Unknown
libmpi_intel.so.1 000014A09AD26896 MPI_Wait Unknown Unknown
wrf.exe 00000000035A5E74 Unknown Unknown Unknown
wrf.exe 000000000180E780 Unknown Unknown Unknown
wrf.exe 0000000001705E09 Unknown Unknown Unknown
wrf.exe 00000000014FBAE8 Unknown Unknown Unknown
wrf.exe 00000000005B97B3 Unknown Unknown Unknown
wrf.exe 00000000004174B1 Unknown Unknown Unknown
wrf.exe 0000000000417471 Unknown Unknown Unknown
wrf.exe 000000000041740D Unknown Unknown Unknown
libc.so.6 000014A09983FE6C Unknown Unknown Unknown
libc.so.6 000014A09983FF35 __libc_start_main Unknown Unknown
wrf.exe 000000000041733A Unknown Unknown Unknown

I used the following setting for the new run:

#PBS -l select=2:ncpus=128:mpiprocs=128

And my working directory is still: /glade/derecho/scratch/hhou/Test_ERA5/WRF/test/em_real

I’m now trying time_step = 12. This repeated crashing has been troubling me for two weeks. Could you help me take a look at what might be causing the issue?

Thanks!
Haoran
 

Attachments

  • rsl.error.0000.txt
    323.6 KB · Views: 0
Top