Several similar runs, most successful, one crashed with SEGV

philipdumont · Oct 23, 2018

All,

We had a WRF crash recently -- SEGV. Attached is a zip of the relevant files: the stdout/stderr (combined) of the MPI command (wrf.log), the rsl.error.* files, and the namelist.input file.

I'm afraid there's not much to go on. There's no usable stack trace (PC counters only, no filename/lineno/functionname).

And here's the strangest part...

This happened in the middle of a year-long run -- all of 2005 -- broken into one day chunks. For each day, the namelist.input file is exactly the same as the namelist.input for the previous day, except for the increment of start_day and end_day. The failure happened while processing 14 July. The WRF runs for all the prior days succeeded. Very curious that this one should fail when all the others, so similar, succeeded.

I don't know if you'll be able to make anything of it, but if you can, you would have my gratitude (not to mention a healthy dose of admiration).

Please let me know if there's any further information I can provide.

kwerner · Oct 23, 2018

Hi,
1) Is this a WRFDA run?
2) Could it be possible that your disk space could be full at that point? Can you check on this?
Thanks!

philipdumont · Oct 24, 2018

1) No.

2) Probably not. Alas, no historical disk usage metrics are being kept on the system, so it's hard to be definitive. But more than one person was keeping half an eye on disk availability during the day, and no one saw it alarmingly low when they looked.

philipdumont · Oct 24, 2018

Is a lack of disk space likely to cause a SEGV? It could cause a write(2) or close(2) system call failure, sure. But SEGV? That's a memory problem, not a disk problem.

kwerner · Oct 24, 2018

Hi,
Yes, lack of disk space can sometimes cause a segmentation fault. Each system/compiler could vary in the error message, though. We have seen this as the case many times. If this is a shared space that you are writing to, and others are not having trouble writing files to the disk, then that likely isn't the problem.

Can you send me your wrfbdy_d01 file, wrfinput_d0* files, wrflowinp_d0* files, and your my_iofields_d0* files so that I can attempt to run this on my end? These files will likely be too large to attach to this forum, so take a look at the home page of this forum for instructions to upload the files to our cloud server, and give me a heads-up when I should be looking out for them. Thanks!

philipdumont · Oct 24, 2018

My attempt to attach the zip file to this reply seems to have succeeded. The zip contains all the files you asked for, with the exception of wrflowinp_d0*, which our model does not use.

Thanks for your attention.

philipdumont · Oct 24, 2018

And, though you didn't ask for them, I'm pretty sure you'll need the met_em* files in the zip attached to this post. I don't know how we tell WRF to use them, but I'm pretty sure our WRF run does use them. So I reckon you'll need them.

kwerner · Oct 24, 2018

I am able to run wrf.exe (which is where your failure is happening) from the input (wrfinput_d01) and boundary (wrfbdy_d01) files that are created during real.exe. The real.exe program uses the met_em* files to run, but thank you for sending them!

I ran this using your namelist.input file, and your input/boundary files with version 3.6.1 and did not have any problems. It ran to completion. The only difference is that I was using 18 processors, instead of 16, but I don't think that should make a difference, given the small size of your domains. Do you know if any of the 3.6.1 files for WRF were modified, or is this just straight "out-of-the-box" code? If there were modifications, I would suggest trying the default, non-modified code to verify whether that works, and if so, then try to track down the modification(s) that is causing the problem. I also suggest checking with the systems administrator at your institution to just verify that there isn't a disk space issue, as this seems to be related to your particular environment. You could also do some debugging to determine the line that is causing the problem. See this FAQ that describes some ways you can do this: http://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=316&p=852#p852

philipdumont · Oct 24, 2018

Thanks, kwener. I think you've helped as much as we can expect you to.

I suppose the most sensible next step for us is to rerun it ourselves and see if it fails again (in which case maybe we can debug it better, per the link you provided), or runs to completion (which would seem to indicate it was just a fluke -- perhaps caused by some resource shortage (disk? something else?) that we just didn't see).

Anyway, thanks again.

Several similar runs, most successful, one crashed with SEGV

philipdumont

New member

Attachments

kwerner

Administrator

philipdumont

New member

philipdumont

New member

kwerner

Administrator

philipdumont

New member

Attachments

philipdumont

New member

Attachments

kwerner

Administrator

philipdumont

New member