Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF Simulation Stalls During Restart File Writing?(WRFV4.6)

Dear WRF Community,

I'm encountering an issue with my WRF simulation where the program appears to stall during execution. The details are as follows:

Configuration:
- WRF Version: [WRFV4.6]
- Parameterization scheme: PBL5, MP38, CU16
- Computational Resources: 5 nodes, 320 cores in total on a supercomputer
1725433588492.png
- FDDA: 0 (in one case), 2 (in another case)
- namelist.input:
Code:
 &time_control
 run_days                            = 4,
 run_hours                           = 0,
 run_minutes                         = 0,
 run_seconds                         = 0,

 start_year                          = 2021, 2021,
 start_month                         = 07,   07,
 start_day                           = 21,   21,
 start_hour                          = 12,   12,
 start_minute                        = 00,   00,
 start_second                        = 00,   00,
 end_year                            = 2021, 2021,
 end_month                           = 07,   07,
 end_day                             = 25,   25,
 end_hour                            = 12,   12,
 end_minute                          = 00,   00,
 end_second                          = 00,   00,

 interval_seconds                    = 10800,
 input_from_file                     = .true.,.true.,
 adjust_output_times                 = .false.
 history_interval_h                  = 3,   3,
 history_begin_h                     = 6,
 frames_per_outfile                  = 1000,  1000,
 restart                             = .false.,
 override_restart_timers             = .false.,
 output_ready_flag                   = .true.,
 restart_interval                    = 1440,
 io_form_history                     = 2,
 io_form_restart                     = 2,
 io_form_input                       = 2,
 io_form_boundary                    = 2,
 debug_level                         = 0,
 auxinput4_inname                    = "wrflowinp_d<domain>",
 auxinput4_interval                  = 360,
 io_form_auxinput4                   = 2,
 io_form_auxinput2                   = 2,
 diag_print                          = 2,
 output_diagnostics                  = 1,
 auxhist3_outname                    = "wrfxtrm_d<domain>_<date>"
 auxhist3_interval                   = 1440, 1440
 frames_per_auxhist3                 = 100, 100
 io_form_auxhist3                    = 2
 /

 &domains
 time_step                           = 27,
 time_step_fract_num                 = 0,
 time_step_fract_den                 = 1,
 max_dom                             = 2,
 e_we                                = 307,   556,
 e_sn                                = 334,   628,
 e_vert                              = 47,    47,
 eta_levels(1:47)                    = 1.000, 0.995, 0.990,
                                       0.985, 0.980, 0.970,
                                       0.960, 0.945, 0.930,
                                       0.910, 0.890, 0.855,
                                       0.820, 0.785, 0.750,
                                       0.715, 0.680, 0.645,
                                       0.610, 0.575, 0.540,
                                       0.505, 0.470, 0.440,
                                       0.410, 0.380, 0.350,
                                       0.325, 0.300, 0.275,
                                       0.250, 0.230, 0.210,
                                       0.190, 0.170, 0.155,
                                       0.140, 0.125, 0.110,
                                       0.095, 0.080, 0.065,
                                       0.050, 0.035, 0.020,
                                       0.010, 0.000,
 p_top_requested                     = 5000,
 num_metgrid_levels                  = 34,
 num_metgrid_soil_levels             = 4,
 dx                                  = 9000, 3000
 dy                                  = 9000, 3000
 grid_id                             = 1,     2,
 parent_id                           = 1,     1,
 i_parent_start                      = 1,     62,
 j_parent_start                      = 1,     62,
 parent_grid_ratio                   = 1,     3, 
 parent_time_step_ratio              = 1,     3, 
 feedback                            = 1,
 smooth_option                       = 0,
/

 &physics
 mp_physics                          = 38, 38,
 !wsm6 wsm7 new Thompson 6 24 38
 write_thompson_mp38table            = .true.,
 do_radar_ref                        = 1,
 acc_phy_tend                        = 1,     1,
 ra_lw_physics                       = 4,     4,       
 ra_sw_physics                       = 4,     4,       
 radt                                = 12,    12,   
 sf_sfclay_physics                   = 2, 2,       
 sf_surface_physics                  = 2,     2,
 sf_urban_physics                    = 1,     1,
 num_soil_layers                     = 4,
 ifsnow                              = 1,
 sst_update                          = 1,
 bl_pbl_physics                          = 2, 2,
 cu_physics                          = 16,     0,
 sf_ocean_physics                    = 1,
 windfarm_opt                        = 0,     0,
 windfarm_ij                         = 1,
 windfarm_wake_model                 = 0,     0,
 windfarm_overlap_method             = 0,     0,
 !set value to 0 or 2
 isftcflx                            = 1,
 /

 &fdda
 grid_fdda                           = 0, 0,
 gfdda_inname                        = "wrffdda_d<domain>",
 gfdda_interval_m                    = 180,
 io_form_gfdda                       = 2,
 if_no_pbl_nudging_ph                = 1,
 if_no_pbl_nudging_t                 = 1,
 if_no_pbl_nudging_q                 = 1,
 if_no_pbl_nudging_uv                = 0,
 if_zfac_uv                          = 1,
 k_zfac_uv                           = 20,
 gph                                 = 0.0003,
 gt                                  = 0.0003,
 gq                                  = 0.0001,
 guv                                 = 0.0003,
 xwavenum                            = 3,
 ywavenum                            = 3,
 /

 &dynamics
 w_damping                           = 0,
 diff_opt                            = 1,      1,
 km_opt                              = 4,      4, 
 mix_isotropic                       = 0,      0,
 diff_6th_factor                     = 0.12,   0.12,
 base_temp                           = 290.
 damp_opt                            = 3,
 zdamp                               = 5000.,  5000.,
 dampcoef                            = 0.2,    0.2,
 non_hydrostatic                     = .true., .true.,
 moist_adv_opt                       = 1,      1,
 scalar_adv_opt                      = 1,      1,
 /

 &bdy_control
 spec_bdy_width                      = 5,
 spec_zone                           = 1,
 relax_zone                          = 4,
 specified                           = .true., .false.,
 nested                              = .false., .true.,
 /

 &namelist_quilt
 nio_tasks_per_group = 0,
 nio_groups = 2,
 /


Issue Description:
1. The program stops producing output, but system processes show it's still running.
1725433032775.png
2. In one case (FDDA0), the simulation appears to stall while writing the wrfrst_d02 file. After forcibly terminating the program, I noticed the file size had increased by only 8KB.
1725433103368.png
3. In another case (FDDA2), the program also stalled at a certain point, but I couldn't identify the specific cause.



Attached Files:
1. namelist.input (for the FDDA0 and FDDA2 case)
2. Relevant rsl output files(delete some data in the middle to reduce size to upload)
3. Screenshot showing the stall during wrfrst_d02 writing

Questions:
1. What could be causing the program to stall during the restart file writing?
2. Are there known issues with this particular combination of parameterization schemes?
3. How can I diagnose the root cause of the stall in the FDDA2 case?
4. What additional debugging steps or information would be helpful to resolve this issue?

Any insights or suggestions would be greatly appreciated. I'm happy to provide any additional information that might be helpful in diagnosing this problem.

Thank you for your time and expertise.

Junius.
 

Attachments

  • namelist-fdda0.input
    6.6 KB · Views: 1
  • rsl.error-fdda2.0000
    6.9 MB · Views: 1
  • rsl.error-fdda0.0000.txt
    4.5 MB · Views: 0
Last edited:
Junius,
We are aware of this issue and several users have reported the same problem.
Would you please run the case with more processors? We did a few tests before, which indicate that running with more processors can solve the problem in most cases, although not always.
Please try and hope it works for you. We haven't figured out yet what is the reason for this model behavior.
 
Junius,
We are aware of this issue and several users have reported the same problem.
Would you please run the case with more processors? We did a few tests before, which indicate that running with more processors can solve the problem in most cases, although not always.
Please try and hope it works for you. We haven't figured out yet what is the reason for this model behavior.
I have to say you are right! It did help!
Thanks so much!
 
Top