Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

error in wrf.exe

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

kaelel18

Member
Hi everyone. I am running WRF in an HPC. WRF would run for a day and then immediately gives an error.

When I use this slurm script,

Code:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=32000
#SBATCH --time=168:00:00
#SBATCH --partition=batch

#set stack size to unlimited ulimit -s unlimited
ulimit -s unlimited 

# Place commands to load environment modules here
module load wrf/3.9.1-intel-mpich

# MAIN
mpirun -n 64 real.exe && mpirun -n 64 wrf.exe

WRF gives an error to reduce the processor

Code:
    ------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:     645
Submit the real program again with fewer processors
-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0


I reduced the number of processor to 16, but what I get is

Code:
Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 596: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO.

What do you think seems to be the problem? Here's my namelist.input by the way,

Code:
&time_control
run_days                 = 31,
run_hours                = 0,
run_minutes              = 0,
run_seconds              = 0,
start_year               = 2016,     2016,     2016,     2016,
start_month              = 12,       12,       12,       12,
start_day                = 1,        1,        1,        1,
start_hour               = 00,       00,       00,       00,
start_minute             = 00,       00,       00,       00,
start_second             = 00,       00,       00,       00,
end_year                 = 2016,     2016,     2016,     2016,
end_month                = 12,       12,       12,       12,
end_day                  = 31,       31,       31,       31,
end_hour                 = 17,       17,       17,       17,
end_minute               = 00,       00,       00,       00,
end_second               = 00,       00,       00,       00,
interval_seconds         = 21600,
input_from_file          = .true.,   .true.,   .true.,
history_interval         = 180,       60,       60,       60,
frames_per_outfile       = 1,        1,        1,        1,
restart                  = .false.,
restart_interval         = 5000,
io_form_history          = 2,
io_form_restart          = 2,
io_form_input            = 2,
io_form_boundary         = 2,
debug_level              = 0, 
auxhist2_outname         = "winds_d<domain>_<date>",
auxhist3_interval        = 0,0,15,15,
io_form_auxhist2         = 2,
frames_per_auxhist2      = 1
/

&domains
eta_levels               = 1.000, 0.9947, 0.9895, 0.9843, 0.979,
                           0.9739, 0.9684, 0.9626, 0.9564, 0.9498,
                           0.9426, 0.9348, 0.9262, 0.9167, 0.9062, 
                           0.8946, 0.8816, 0.8671, 0.8509, 0.833, 
                           0.813, 0.7909, 0.7667, 0.7402, 0.7116,
                           0.6809, 0.6483, 0.6141, 0.5785, 0.5419,
                           0.5047, 0.4672, 0.4299, 0.3931, 0.357, 
                           0.322, 0.2883, 0.256, 0.2253, 0.1963,
                           0.169, 0.1435, 0.1171, 0.0952, 0.0753, 
                           0.0571, 0.0407, 0.0257, 0.0122,                                 0.000,
time_step                = 125,
time_step_fract_num      = 0,
time_step_fract_den      = 1,
max_dom                  = 4,
e_we                     = 26,       76,      276,      456,
e_sn                     = 37,      126,      231,      671,
e_vert                   = 50,       50,       50,       50,
p_top_requested          = 5000,
num_metgrid_levels       = 32,
num_metgrid_soil_levels  = 4,
dx                       = 25000,     5000,     1000,      200,
dy                       = 25000,     5000,     1000,      200,
grid_id                  = 1,        2,        3,        4,
parent_id                = 1,        1,        2,        3,
i_parent_start           = 1,        6,       12,      121,
j_parent_start           = 1,        6,       17,       52,
parent_grid_ratio        = 1,        5,        5,        5,
parent_time_step_ratio   = 1,        5,        5,        5,
feedback                 = 1,
smooth_option            = 0,
/

&physics
mp_physics               = 6,        6,        6,        6,
ra_lw_physics            = 1,        1,        1,        1,
ra_sw_physics            = 1,        1,        1,        1,
radt                     = 30,       30,       30,       30,
sf_sfclay_physics        = 1,        1,        1,        1,
sf_surface_physics       = 2,        2,        2,        2,
bl_pbl_physics           = 1,        1,        1,        1,
bldt                     = 0,        0,        0,        0,
cu_physics               = 1,        1,        0,        0,
cudt                     = 5,        5,        5,        2,
isfflx                   = 1,
ifsnow                   = 0,
icloud                   = 1,
surface_input_source     = 1,
num_soil_layers          = 4,
sf_urban_physics         = 0,        0,        0,        0,
maxiens                  = 1,
maxens                   = 3,
maxens2                  = 3,
maxens3                  = 16,
                                                                           ensdim                   = 144,
/

&fdda
/

&dynamics
w_damping                = 1,
diff_opt                 = 1,
km_opt                   = 4,
diff_6th_opt             = 0,        0,        0,        0,
diff_6th_factor          = 0.12,     0.12,     0.12,     0.12,
base_temp                = 290.,
damp_opt                 = 0,
zdamp                    = 5000.,    5000.,    5000.,     5000,
dampcoef                 = 0.2,      0.2,      0.2,      0.2,
khdif                    = 0,        0,        0,        0,
kvdif                    = 0,        0,        0,        0,
non_hydrostatic          = .true.,   .true.,   .true.,   .true.,
moist_adv_opt            = 1,        1,        1,        1,
scalar_adv_opt           = 1,        1,        1,        1,
/

&bdy_control
spec_bdy_width           = 5,
spec_zone                = 1,
relax_zone               = 4,
specified                = .true.,  .false.,  .false.,  .false.,
nested                   = .false.,   .true.,   .true.,   .true.,
/

&grib2
/

&namelist_quilt
nio_tasks_per_group      = 0,
nio_groups               = 1,
/

error1.png
 
With such a small number of grids, i.e.,
e_we = 26, 76, 276, 456,
e_sn = 37, 126, 231, 671,

64 processors are way too much to run this case. Please reduce the number of processors to 2, or increase the grid numbers.
 
Hi. I reduced the processor for real.exe to 16 and for the wrf.exe to 64. It worked well. Problem is, 1 day simulation given the namelist below consumes more than 2 days and I have to run it for 1 month. How do I improve the performance of my WRF run? Below is the hardware specification of the HPC.

Code:
Total number of nodes: 48
Superserver 2027PR (48 x Intel ® Xeon ® CPU E5-2697 v2 @ 2.70GHz)
256 Gigabytes of RAM
2 x 10 Gbps ethernet
 
The problem is that the size of your domains differs so much. d01 is only 26x37, while d04 is 456x671. The amount of processors you would need to run d04 efficiently is much larger than the number you are allowed for d01. Take a look at this FAQ that describes the problem, and advises on how to choose a good number of processors: http://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=5082

I should also note that we never advise having a domain that is any smaller than 100x100 grid cells. A smaller domain isn't large enough to allow time for systems to fully propagate through with time to spin up some effects from the model and its physics. You are essentially using the boundary conditions to run the model, without any additional calculations. Your d01 is very small and therefore will likely not provide any useful output. I would advise increasing d01 and d02 and then, based on the link I provided above, finding a number of processors that works best for your application. It may take a few small tests to pinpoint the number that is best for you. You can run short simulations and then use interpolation to determine the total amount of time a longer run will take.
 
Hi kwerner. I thought of asking about the number of grids <100 in this thread rather than at the other one. I have a problem about it though and would like to ask for your advice. At first, i configured my D01 having dx=dy=25kms with grid spaces > 100. I was told that the minimum scale down for FNL with 1 degree spatial resolution is with a ratio of 5, hence the 25 kms.The problem is, my domain is so large it reaches other countries too. View attachment ncview.HGT_M.ps

When i reduced the grid spaces, I was able to put the D01 boundaries just right. View attachment ncview.HGT_M1.ps

I know, based from your recommendation that grid spaces should be >100. Is there anyway I can reduce the size of my domain with grid resolution of 25kms and grid spaces >100?
 
Hi,
I actually moved this back to this post so that we can keep things organized and because the other topic is regarding a new type of static data, and not necessarily a domain size/namelist.input issue.

It shouldn't really matter if d01 overlaps another country. In the end, I suspect you're only interested in d04 for analysis. It's quite often that we have domains that overlap other places we are not interested in. If you still feel this would be a problem, can you explain more why? Thanks.
 
Hi kwerner. I’d like to ask for your advice regarding the configurations above. Is it better to use gdas/fnl at 0.25 so that i can reduce my parent domain at around 9kms so as not to make it big? Will that work? Thanks
 
if you run the case with the quarter degree GFS data as input, then I think it is better to reduce the max_dom to 3. You can set the grid intervals to be
dx = 5000, 1000, 200
max_dom=3

You can also set the grid numbers to be approximately the same for the parent and child domain. It doesn't really matter if the parent domain is fairly large, because it won't cause large increase in the computation time.
 
Top