Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation Faults just prior to integration

MichaelIgb

New member
Hello, I'm currently running WRF V4.5.2 built with Classic Intel compilers (dm+sm) on Princeton's research computing clusters. I'm using 0.25 GFS as input data, and when I run the model it always seg faults right before integration. The rsl.out/error files didn't show the fault, but I hope they can provide insight along with the namelist.input.

Thank you so much in advance.
 

Attachments

  • namelist.input
    4.9 KB · Views: 4
  • rsl.error.0000
    8.3 KB · Views: 4
Last edited:
Good morning @MichaelIgb ,

In your WRFV4.5.2/run folder try entering these commands and see if any of the rsl files show an error:


Bash:
grep -i FATAL rsl.*

grep -i error rsl.*

grep -i SIGSEGV rsl.*

grep -i cfl rsl.*

Sometimes the error is hidden deep in the rsl files. Also do you have the WRF input files created from real.exe? If not then the problem is with real.exe.
 
Good morning @MichaelIgb ,

In your WRFV4.5.2/run folder try entering these commands and see if any of the rsl files show an error:


Bash:
grep -i FATAL rsl.*

grep -i error rsl.*

grep -i SIGSEGV rsl.*

grep -i cfl rsl.*

Sometimes the error is hidden deep in the rsl files. Also do you have the WRF input files created from real.exe? If not then the problem is with real.exe.
Good morning, thank you so much.
I got this when I entered in "grep -i error rsl.*":

rsl.error.0000:forrtl: error (69): process interrupted (SIGINT)

The other grep commands didn't output anything.

For the input files, yes I have the file (wrfinput_d01) and boundary file as well (wrfbdy_d01)
 
Good morning, thank you so much.
I got this when I entered in "grep -i error rsl.*":

rsl.error.0000:forrtl: error (69): process interrupted (SIGINT)

The other grep commands didn't output anything.

For the input files, yes I have the file (wrfinput_d01) and boundary file as well (wrfbdy_d01)
@MichaelIgb

Do you have a input file for d02?

I see in your namelist you are telling WRF you have two domains
 
gotcha, hmm let me take a more detailed look
Actually, I may know the cause behind the fault. I'm getting this error with em_tropical_cyclone in both dm+sm and dmpar:

[tiger-h26c1n10:36488] *** An error occurred in MPI_Comm_size
[tiger-h26c1n10:36488] *** reported by process [47598991376384,1]
[tiger-h26c1n10:36488] *** on communicator MPI_COMM_WORLD
[tiger-h26c1n10:36488] *** MPI_ERR_COMM: invalid communicator
[tiger-h26c1n10:36488] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[tiger-h26c1n10:36488] *** and potentially your MPI job)



and it does match some descriptions online about what may be the cause of some segmentation faults. I have no idea how to correct this though.
 
@ MichaelIgb

Would you please recompile WRF in dmpar mode, then rerun this case? if it still fails, please post the error message and your namelist.input and namelist.wps for me to take a look. Thanks.
 
@ MichaelIgb

Would you please recompile WRF in dmpar mode, then rerun this case? if it still fails, please post the error message and your namelist.input and namelist.wps for me to take a look. Thanks.
I recompiled in dmpar mode, and got the same error. Here are the files, thank you.

Nothing showed with the grep commands this time, oddly enough.
 

Attachments

  • namelist.wps
    1.1 KB · Views: 1
  • namelist.input
    4.9 KB · Views: 1
  • rsl.error.0000
    8.3 KB · Views: 2
  • rsl.out.0000
    8.3 KB · Views: 1
I recompiled in dmpar mode, and got the same error. Here are the files, thank you.

Nothing showed with the grep commands this time, oddly enough.
Also if it helps, I did the same stuff for an em_tropical_cyclone case (namelist is the default, unedited) and I still seg faulted.
 
GFS data is widely used to provide initial and boundary conditions for WRF run. I would like to believe that the input data is correct.

Please clarify how you run the model. What command did you issue ? How many processors did you use to run this case?

Your namelist.input looks fine. But let's run a test case with the following options, which will help to figure out what could be possible reasons:

use_adaptive_time_ste = .false
sst_skin = 0
sf_ocean_physics = 0
lightning_option = 0, 0, 0

Let me know whether the case can run with the above options.
 
GFS data is widely used to provide initial and boundary conditions for WRF run. I would like to believe that the input data is correct.

Please clarify how you run the model. What command did you issue ? How many processors did you use to run this case?

Your namelist.input looks fine. But let's run a test case with the following options, which will help to figure out what could be possible reasons:

use_adaptive_time_ste = .false
sst_skin = 0
sf_ocean_physics = 0
lightning_option = 0, 0, 0

Let me know whether the case can run with the above options.
It does the same thing, even when I use ulim -c unlimited. However, it got about 8 hours in before the fault occurred this time.

Here's my files, I used a slurm job to execute with one node (40 processors and 160GB RAM), which I'm also attaching.
 

Attachments

  • wrf.slurm.txt
    1.5 KB · Views: 1
  • rsl.error.0000
    567.4 KB · Views: 1
Last edited:
Now that the model ran for 8 hours before it crashed, I suppose the input data and the command you used to run the case are correct. The model crash might be attributed to other reasons.
The RSL file you attached doesn't include any helpful information. Can you look at all your rsl files and find possible error messages? Note that critical error messages don't necessraoly exist in rsl.error.0000. They can show in any rsl files, depending on when and where the problem first popped up.
 
By the way, can you rerun this case but using WRFV4.2.1? Please keep me updated whether you get the same error.
 
Top