Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Memory value access issue? Running WRF on Cheyenne

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

DeanCH

New member
Hi all,

Very new to both WRF and submitting jobs in Cheyenne, so I greatly appreciate any guidance you can give. I have successfully run WRF before for cases with fewer grid points. This is the first run I'm trying where the grid point number is around 1 million - and I am running it with precompiled WRF 4.1 on Cheyenne (GLADE path = /glade/work/henzede/WRF_runs/summer_dwnrmp_20160723/, from there you will see the WPS and WRF directories). I have successfully compiled everything up to real.exe. When I try to run wrf.exe I get the same issue every time - the WRF output files for both my domains (parent + nested) for the '0th' timestep are created, but the job dies shortly after. Attached here are namelist files, the wrf log file, a couple of example rsl out and error files (all of them seem to be identical), and my cheyenne job request script View attachment files_for_wrf_forum.tar.gz. The rsl file tails have no error messages, but the wrf.log file shows:
-------------------------------------------------------------
"""MPT ERROR: MPI_COMM_WORLD rank 99 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11"""
-------------------------------------------------------------

I have been in contact with someone from CISL help, who responded:
-------------------------------------------------------------
"the system logs for job 8705730 say it is segfaulting (SIGSEGV, signal 11), showing this error:

2019-10-03T10:17:41-06:00 r4i0n29 MPT[40186]: user=henzede:app=wrf.exe:MPT ERROR: Rank 111(g:111) received signal SIGSEGV(11)

You can look up SIGSEGV on the web, and there are several different causes. One of the major ones is that the WRF program is trying to access some value outside of the memory allocated for it, in Cheyenne.

Often, this is caused by incorrect namelist files. If you review the namelist files that you changed for this run, perhaps you will see an incorrect setting. If you don't see anything wrong, it is time to take the problem higher, to the WRF experts [WRF forum]."
-------------------------------------------------------------

So here I am!

One thing I noticed was that some of the RSL files have some lines like:
-------------------------------------------------------------
"""
Nesting domain
ids,ide,jds,jde 1 1001 1 901
ims,ime,jms,jme 75 180 -4 60
ips,ipe,jps,jpe 85 168 1 50
"""
-------------------------------------------------------------

and that jms=-4 seems suspicious? Especially given the "WRF program is trying to access some value outside of the memory allocated for it" comment from CISL help above. I don't know if it's coincidence, but in the namelist.input I also see that "relax_zone = 4" in the &bdy_control section.

Any guidance is greatly appreciated. I also think that it is different from the post 5 months ago: https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=40&t=5362&p=10005&hilit=MPT%3A+Received+signal+11#p10005, since they were able to run wrf upon resubmission.

Update: I tried running the same namelist options except with only the first domain. wrf.exe got further into the simulation before crashing - about an hour and five minutes model time, rather than 5 min model time with the nested domain. The same error message appears as for the nested run.

Best,

Dean
 
Hi Dean,

If you issue the following:
grep cfl rsl.*

you will see many CFL errors. This means that the model has become unstable, typically due to complex terrain. Take a look at this FAQ that addresses the CFL problem, and see if any of the modifications will help. This may not be the only problem you will have, but this will absolutely cause segmentation faults, so let's clear this one up first. Let me know how things go after modifying a bit. Thanks!
 
Hi kwerner, I see what you mean, and this makes sense since I'm simulating over the Cascades. I changed some of the diffusion options and the model runs! Thanks for you reply, much appreciated.
 
Top