Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Running WRF with 100 vertical levels

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

brianjs

New member
Hello All,

I am attempting to run the WRF (v3.8) with additional vertical levels, but I get errors through part of the run. First, I would like to provide a rundown of what I am doing.

I am running a real-case based simulation with WRF, using one-way nested grid spacing. The parent 3-km grid (600 X 600 grid points or 1800 X 1800 km) is nested with a 1-km grid (1000 X 1000 grid points/km), and nested within that, a 0.333-km grid (1800 X 1800 grid points or 600 X 600 km). I output at hourly intervals, with the parent grid running from 12Z-12Z (24 hours), with the inner nests running from 21Z-09Z (I am simulating the 30 Jul 2018 central Plains nocturnal MCS event). I run with a time step of 2 seconds. I am running WRF on the UCAR-CISL-Cheyenne supercomputer, employing 150 nodes, with 36 CPUs per node.

When running all grids with the 50 default vertical levels, everything works just fine. When running with 100 vertical levels, 3-km seems to work fine, but once I get to 23Z (2 hours into the simulation incorporating the finer nested grids) the model crashes. I get the following error in the wrf.out log file:

MPT ERROR: MPI_COMM_WORLD rank 3586 has terminated without calling MPI_Finalize()
aborting job

The rsl.out files don’t show that much, but going to the rsl.error.3586 file, I found the following error:
MPT ERROR: Rank 3586(g:3586) received signal SIGSEGV(11).

My understanding is that this error can occur for many reasons. In my directory for all rsl files, I performed the ‘grep cfl rsl*’ command to see if I could fish out cfl errors, but could not find any (though I have heard that sometimes wrf does not always explicitly state cfl errors). I have hourly restart files, so I tried rerunning wrf with a time step of 1 second, while also setting debug to ‘1000’ to see if that would help. I still get the same errors, with no additional information provided. I also tried running with 250 nodes to potentially alleviate a memory allocation issue but again, same errors. Even for rerunning for 1 hour (I attempt to rerun at 23Z), testing with my current model configuration is very expensive in terms of core-hours, and I am trying to be careful with my allotment while debugging. On Cheyenne, my WRF directories are as such:

WPS: /glade/scratch/brianjs/wrf_run/WRF_3.8_nestBuildMPThompsonTendencies/WPS
WRF: /glade/scratch/brianjs/wrf_run/WRF_3.8_nestBuildMPThompsonTendencies/WRFV3/run
In the WRF directory, the wrf.out and rsl files can be found in the ‘rsl_150' folder.
Also in the WRF directory is my script for running wrf.exe (runwrf.tcsh)
In WPS, I have a plot showing my domains (domains_07_30_2018.png).

Any help that can be provided would be greatly appreciated!
 
Top