Restart fails to run correctly

abrandi83 · Oct 3, 2023

Bounced here from the github issues section.

Describe the bug
Whenever I try to restart a simulation using any of the available restart points the job apparently starts and keeps going until canceled but no calculation is ever performed, nor any output file is produced.
If I type the following:
grep "Timing for main" rsl.out.0000 | grep " 1:"
the output is always empty.

In a properly running simulation usually the output has this form:
Timing for main: time 2022-03-27_21:41:24 on domain 1: 12.74597 elapsed seconds
Timing for main: time 2022-03-27_21:41:42 on domain 1: 12.73683 elapsed seconds
Timing for main: time 2022-03-27_21:42:00 on domain 1: 12.77265 elapsed seconds
Timing for main: time 2022-03-27_21:42:18 on domain 1: 12.74004 elapsed seconds
Timing for main: time 2022-03-27_21:42:36 on domain 1: 12.77451 elapsed seconds
Timing for main: time 2022-03-27_21:42:54 on domain 1: 12.79238 elapsed seconds

When checking the rsl.out.000* and rsl.error.000* the only clue I get is the following:
module_io.F: in wrf_read_field
Warning BAD MEMORY ORDER |ZZ| for |ISEEDARR_MULT3D| in ext_ncd_read_field wrf_io.F90
input_wrf.F reading 2d integer iseedarr_mult3d Status = -19
input_wrf.F reading 0d logical is_cammgmp_used
I get this at a rather initial stage in rsl.out.0000, however it shows the model keeps going reading the wrf_restart files after that.

To Reproduce
Steps to reproduce the behavior:

I'm using WRF V4.5 compiled with gcc and openmpi
Relevant namelist options:

&time_control
restart = .true.,
restart_interval = 4320,
io_form_history = 2
io_form_restart = 2
io_form_input = 2
io_form_boundary = 2
debug_level = 10000000,
&domains
time_step = 18,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 4,
e_we = 151, 127, 85, 79
e_sn = 92, 118, 52, 76
e_vert = 85, 85, 85, 85,
p_top_requested = 10000,
num_metgrid_levels = 34,
num_metgrid_soil_levels = 4,
dx = 3000.0, 1000.0, 333.0, 333.0,
dy = 3000.0, 1000.0, 333.0, 333.0,
grid_id = 1, 2, 3, 4,
parent_id = 1, 1, 2, 2,
i_parent_start = 1, 16, 54, 20
j_parent_start = 1, 17, 60, 21
parent_grid_ratio = 1, 3, 3, 3,
parent_time_step_ratio = 1, 3, 3, 3,
feedback = 0,
smooth_option = 0
smooth_cg_topo = .true.
! physics_suite = 'CONUS'
mp_physics = 16, 16, 16, 16,
cu_physics = 1, 0, 0, 0,
ra_lw_physics = 4, 4, 4, 4,
ra_sw_physics = 4, 4, 4, 4,
bl_pbl_physics = 2, 2, 2, 2,
sf_sfclay_physics = 2, 2, 2, 2,
sf_surface_physics = 4, 4, 4, 4,
radt = 9, 9, 9, 9,
bldt = 0, 0, 0, 0,
cudt = 5, 0, 0, 0,
icloud = 0,
num_land_cat = 61,
sf_urban_physics = 3, 3, 3, 3,
use_wudapt_lcz = 1,

Output is none

Expected behavior
I would expect the simulation to resume from the restart point, as usual

Attachments
sbatch script used to run the job pasted below (can't see a proper attachment button anywhere):

#!/bin/bash
#SBATCH --qos bigmem
#SBATCH --mem=50000
#SBATCH -c 5
#SBATCH --nodes 1
##SBATCH --exclusive
#SBATCH --ntasks 5
#SBATCH --cpus-per-task 1
#SBATCH -t 15-00:00:00
#SBATCH -J Leman_August_50m
module load gcc openmpi
BASE=$HOME/software/wrf
PNETCDF_ROOT=$BASE/pnetcdf-install
NETCDF_ROOT=$BASE/netcdf-install
SZIP_ROOT=$BASE/szip-install
HDF5_ROOT=$BASE/hdf5-install
#WRF_ROOT=$BASE/WRFV4.5
#export PATH=$WRF_ROOT/main:$NETCDF_ROOT/bin:$PNETCDF_ROOT/bin:$HDF5_ROOT/bin:$SZ_ROOT/bin:$PATH
WRF_ROOT=/ssoft/spack/syrah/v1/opt/spack/linux-rhel8-icelake/gcc-11.3.0/wrf-4.5-3k4uylttabrety2iu2au24lc2thhycx6/main
export PATH=$WRF_ROOT/main:$PATH
export LD_LIBRARY_PATH=$NETCDF_ROOT/lib:$PNETCDF_ROOT/lib:$HDF5_ROOT/lib:$SZ_ROOT/lib64:$SZIP_ROOT/lib:$LD_LIBRARY_PATH
ulimit -s unlimited
#run real.exe -j 6
srun wrf.exe -j 6

Additional context
I started experiencing this issue as soon as I moved to my current institution on a new supercomputer with a slightly different architecture than the previous one I used to work on. Never had this issue before.
The resident IT staff is unresponsive on this issue/behavior.

kwerner · Oct 5, 2023

Hi,
If everything is the same, except the architecture (e.g., same namelist.input, same input data, dates, domain sizes, physics, etc.), then unfortunately it is probably just an issue with the environment and you may have to keep pushing for IT to respond.

That being said, I can try to take a look. First, can you set debug_level = 0 and run it again? This namelist entry was removed from newer versions of WRF because it rarely provides anything useful and simply adds a lot of extra junk to the rsl files, making them difficult to read. After you do that, assuming it still fails, please package all of the rsl* files into a single *.tar file and attach that so I can take a look. Please also attach the full namelist.input file. Thanks!

abrandi83 · Oct 6, 2023

Hi,
thank you for replying and willing to have a look at this issue.

First of all I'd like to clarify that everything now is different from the simulations I used to run on the previous facility.
I just meant to say that I never had this issue with restarts before.

I set the debug level to 0 and reran the job from the first available restart point (there are several but the outcome is the same, regardless).
I'm attaching the rsl (out and error) files and the full namelist.input file.

Thank you for the assistance,
Aldo

abrandi83 · Oct 9, 2023

Hi,

I installed version 4.5.1 and tried to restart a simulation from a fresh set of restart files, also produced running a simulation from scratch with version 4.5.1
The outcome is exactly the same.
I'm attaching rsl* and namelist.input files for this run too.

Hope this helps.

Thanks,
Aldo

kwerner · Oct 11, 2023

Thanks for sending that. Prior to running the restart, how long was the original simulation?

There are a few things I've noticed that could cause issues, or are just generally not following best practices.
1) Your domain sizes are too small. You should never use domains smaller than 100x100 (e_we and e_sn).
2) Your processing decomposition is probably not the best. You are using 5 processors, which decomposes as 1x5. It's best to use something that is closer to a square (it doesn't have to be a perfect square).
3) I notice your d01 is 3km, which is a pretty high resolution. Depending on the resolution of your input data, you may need another domain around the 3km resolution domain. We recommend no more than about a 5:1 ratio between the resolution of the input data and the parent domain.

Restart fails to run correctly

abrandi83

New member

kwerner

Administrator

abrandi83

New member

Attachments

abrandi83

New member

Attachments

kwerner

Administrator