Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault for unknown reason

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

apattant

New member
I ran a number of cases some months ago and realized that I had not been using the right LSM option. I went to rerun the simulation with the correct option (2 instead of 1) and now the model will not run to completion. It stops at around the same spot in every instance. I have included all the rsl files and my namelist for review. The message does not reveal much about why it stopped. My namelist works with other data and namelist settings.

I had been running with 8 nodes (32 procs each) but for this example I ran with 4 nodes and the only difference was that it ran two minutes longer than previously.

I see no reason why it would fail after 7+ hours of simulation time if there were a namelist issue but I must be missing something.

Thank you.
 

Attachments

  • output.tar
    7.5 MB · Views: 62
This is a restart run and the case failed immediately after the restart. I wonder how long did the model integrate before the restarting time?
Have you looked at wrfrst to make sure everything is fine?

You say your namelist works with other data and settings, do you mean that when WRF was driven by other input data, the same case worked? Please clarify.

I found segmentation fault in rsl.error.0043.wrf, which sometimes indicate a memory issue. Can you try with more nodes ( and thus more memory) to see whether it works?
 
Hi Ming Chen,

Thank you for your insight.

The model ran for 6 hours before the restart. So the model failed around 7 hours and 39-41 minutes into the simulation.

I just ran with 8 nodes (256 procs) and it failed about the same place. I have also tried with 4 nodes but same result. The machine has lots of memory though so not convinced that is the issue.

I run with the same met data (GFS) but different land surface data and just changing the LSM options. It runs for options 1 (thermal) and 4 (Noah-MP) but now is failing with option 2 (Noah). I have run another time period as well and it runs all the way through.
 
By saying "I have run another time period as well and it runs all the way through", do you mean you run the case with the same namelist options (i.e., with Noah LSM) and GFS input data, and the case is done successfully?

By "different land surface data ", do you mean you choose different static data or you make some other changes?

Which version of WRF are you running? Can you post your namelist.wps and namelist.input for me to take a look?
 
Hi Ming Chen,

I did simulations for two different time periods May 10-12 and May 14-16 using WRFv3.8.1. May 10-12 finished with no problems. May 14-16 ran successfully with different LSM option (4).

In both cases mentioned above I ran experiments with a default CORINE landuse and then a manually edited CORINE landuse dataset. As above all experiments for May 10-12 finished successfully and experiments for May14-16 finished successfully only with LSM option 4.

When I mentioned to a colleague that memory may be an issue, he suggested I run with fewer processors on the same number of cores to free up more memory. This had no impact - in fact when I went to 4 nodes and only 8 processors (instead of 32) the job crashed much sooner.

My namelist.input I posted in a tar file in my first post. when I run diff on namelist.input files between experiments the only differences are for LSM options and start/end days as expected.

Andre
 
Andre,
One option is to compile WRF in debug mode (./configure -d), then rerun the failed case. This will tell exactly when and where something goes wrong. This will be helpful for debugging the problem.
Due to limited human power here in NCAR, please understand that we are no longer able to help debugging individual user's case.
 
Unfortunately, the "rsl" file you loaded doesn't help. It says SIGTERM, which is "kill -15". This is a consequence
of the SIGSEGV (segmentation fault). In an MPI job that has a program crash on one processor, the wrapper
task that runs the mpi job will send SIGTERM to all the processors.

Find one that has the SIGSEGV in it and post that.
 
Hi Kevin,

There was no SIGSEGV in any rsl file during that simulation but several other simulations contain a SIGSEGV. The problem is they are not in debug mode. When I run in debug mode the job does not seg fault until the job hits the end of its walltime which is odd. I am starting to think that perhaps the issue is that the job is integrating too rapidly on the cluster and running out of memory when using noah lsm. noahmp lsm takes more compute time and does not experience this issue and in debug moed the integration is much slower as well.

Andre
 
Top