Segmentation fault while running physics schemes tests with NDOWN inputs

Arty · Dec 16, 2023

Hello,

After I successfully ran a "Base" physics configuration (see table below) multiple times, doing high-res. topgraphy and vertical-levels-number experiments, I needed to try other physics parameterizations. According to an extensive literature review and my case's characteristics (tropical, oceanic, mountainous island), I selected two physics schemes configurations, Test1 and Test2, as shown in table below :

&Physics	Base	Test1 (WD_KF_CAM)	Test2 (WS_TK_RRTMG)
mp_physics	6. WSM6	16. WDM6	6. WSM6
cu_physics	2. BMJ	1. KF	6. TDK
ra_lw_physics	1. RRTM	3. CAM	4. RRTMG
ra_sw_physics	1. Dudia	3. CAM	4. RRTMG
sf_surface_physics	1. Thermal Diffusion	2. Noah LSM	2. Noah LSM
sf_sfclay_physics	1. Revised MM5	1. Revised MM5	1. Revised MM5
pbl_physics	1. YSU	1. YSU	1. YSU

I did not find any contraindication between the physical schemes selected for each of both new configurations.

However, both runs crash. I checked that it's not a CFL condition breach.

Nevertheless I tried decreasing the timestep from 40 seconds to 36 seconds. Still both runs crash at respectively the 7th and 8th step (equivalent to approx. 4-5 mins). See below :

Code:

cat rsl.error.0007 | tail -n 20 | head -n 10
d01 2013-09-01_00:04:12 calling inc/HALO_EM_PHYS_A_inline.inc
d01 2013-09-01_00:04:12 Top of Radiation Driver
d01 2013-09-01_00:04:12 calling inc/HALO_PWP_inline.inc
d01 2013-09-01_00:04:12 calling inc/HALO_CUP_G3_IN_inline.inc
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source      
wrfexe             000000000288329D  for__signal_handl     Unknown  Unknown
libpthread-2.19.s  00002AAAADE23870  Unknown               Unknown  Unknown
wrfexe             000000000229590C  Unknown               Unknown  Unknown
wrfexe             000000000228C156  Unknown               Unknown  Unknown

Code:

cat rsl.error.0012 | tail -n 20 | head -n 10
d01 2013-09-01_00:04:48 Top of Radiation Driver
d01 2013-09-01_00:04:48 CALL cldfra1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source      
wrfexe             000000000288329D  for__signal_handl     Unknown  Unknown
libpthread-2.19.s  00002AAAADE23870  Unknown               Unknown  Unknown
wrfexe             0000000001C6FD44  Unknown               Unknown  Unknown
wrfexe             0000000001C0B286  Unknown               Unknown  Unknown
wrfexe             0000000001C05C28  Unknown               Unknown  Unknown
wrfexe             0000000001BFBACC  Unknown               Unknown  Unknown

I then tried decreasing to 30 seconds, the simulations go on for 2 more steps each, to 9th and 10th (again, around 4-5 mins). When reducing the timestep to 10 seconds for Test1, I get further in the steps (until 29th), but still : the run crashes around 5 minutes in.

I also did investigate the wrfinput files (despite the script doesn't handle very well powered number, it's OK for a quick tour I guess) :

Code:

./minmax.sh wrfinput_d01_2013_09
Variable        Minimum         Maximum
---------------------------------------------
T2:             296.2937,       301.6903,
T:              -6.958842e-05,  201.2966,
PSFC:           91929.33,       101260.6,
P:              0.6750414,      1259.358,
U10:            -8.037662,      -1.940215,
V10:            -2.003774,      4.477623,
U:              -10.00002,      27.52425,
V:              -9.141862e-05,  17.43193,
W:              -9.000254e-05,  9.999888e-05,
SST:            296.163,        303.1039,
TSK:            296.163,        303.1039,
TH2:            295.2446,       301.8733,
Q2:             0.01450976,     0.01995717,
QCLOUD:         0               9.9431e-06,
QVAPOR:         0.0001000018,   9.999984e-07,

./minmax.sh wrfinput_d02_2013_09
Variable        Minimum         Maximum
---------------------------------------------
T2:             297.4885,       301.6901,
T:              -8.41996e-05,   194.4953,
PSFC:           91930.78,       101171.6,
P:              0.8658642,      1170.362,
U10:            -8.000412,      -1.995662,
V10:            -2.003796,      4.477623,
U:              -9.000722,      22.21929,
V:              -9.379655e-05,  12.23109,
W:              -9.000978e-06,  9.999872e-05,
SST:            296.1638,       302.3132
TSK:            296.1638,       302.3132
TH2:            297.0318,       301.8732,
Q2:             0.01605123,     0.01889061,
QCLOUD:         0               9.998767e-08,
QVAPOR:         0.0001000005,   9.999933e-07,

Using ncview and looking at some variables didn't help either : I can't find anything wrong in wrfinput files. About other inputs : wrflowinp files are unchaged since previous "Base" runs (contain only VEGFRA and SST). As for wrfbdy files obtained via NDOWN, it can't unfortunately be visualized by ncview. However, I respected the whole previously successful process without any sign of an error and am pretty confident the problem doesn't originate in this file.

Lastly : the first wrfout* files (approx 15 Mb size) contains no time mark nor are being readable by ncview.

Attached hereunder are the tar files containing all rsl.error* files as well as the namelist.input, for each of both configurations.

Thank you for your time.

EDIT : it seems, once again, the problem originates from NDOWN's wrfinput* files. The config. works fine when initialized by REAL's ouputs (even though the config. used in the REAL namelist is not exactly the same as the one run in WRF).

Arty · Dec 17, 2023

The problem definitely arises from the wrfinput* files. Initially, I ran two double-nested domains. With a different physics configuration, I explored two options:

1) Using NDOWN's wrfinput file and running the first domain only.

2) Extract REAL's wrfinput* files as grids to regrid the original wrfout file and create wrfinput* files from it (utilizing a script of my own with NCO & CDO, which has proven its efficacy with "Base" configuration above).

Unfortunately, neither of these approaches yielded success. The only way I managed to get the computation running was by using REAL's (based on CFSR) wrfinput* files, along with NDOWN's wrfbdy* files.

While it's not a major issue, as I'm confident that the initial conditions are quickly "blended" and overcome with a sufficient spin-up time, I would prefer to succeed in using the "right" initial conditions based on the wrfout* files that I'm using for the LBC.

Any hints?

Ming Chen · Dec 17, 2023

If the model crashed immediately, it often indicates that the input data is wrong. I believe you are right that the wrfinput is not correct. I wonder whether you follow exactly the same steps described WRF User's Guide for running ndown?
Below is the link of how to run ndown:

https://www2.mmm.ucar.edu/wrf/users/docs/user_guide_v4/v4.0/users_guide_chap5.html#_e._One-way_Nested

Arty · Dec 18, 2023

Ming Chen said:
If the model crashed immediately, it often indicates that the input data is wrong. I believe you are right that the wrfinput is not correct. I wonder whether you follow exactly the same steps described WRF User's Guide for running ndown?
Below is the link of how to run ndown:

https://www2.mmm.ucar.edu/wrf/users/docs/user_guide_v4/v4.0/users_guide_chap5.html#_e._One-way_Nested

Hello M. Cheng,

I did follow the User's Guide for the main instructions (see bold comment here - maybe it has been done but as I'm working with old WRF V3.6 I may not know of recent updates...). Also NDOWN's functionalities could be improved (Point 2 here).

My case is somewhat special as I use wrfout* files from another person's previous work from 2018 ; and I only have access to his wrfout* files. I'm running a 2-domains double-way nested run, which are forced by old wrfout* files via NDOWN. But the latter only provide wrfinput for d01, so I need to create (by interpolation of NDOWN's d01 wrfinput) a wrfinput for d02 : this step proved to work fine for "Base" configuration. This also leaves room for another NDOWN's improvement.
I'm also doubling vertical resolution using NDOWN's vertical refinement, which prevents me to use horizontal interpolation from 33 vert. levels wrfout* files to 65 vert. levels wrfinput_d02 file (as I successfully did creating 33 vert. levels wrfinput files before).
Finally, the process I follow worked for the "Base" configuration, so this looks more like the physics is the problem here. For example I noticed that surface physics cannot be easily interchanged : REAL has to be run again - I can't use inputs with SF_PHYSICS = 1 if sf_surface_physics = 2 in the namelist (and vice versa) - even for runs that don't use NDOWN's outputs at all.

Also, one shall keep in mind that the computation for both new configs runs for a few steps : ergo it's not a classic immediate segfault when the run doesn't even start.

I'll run some other tests, making the physics configuration slightly vary from "Base" to try and detect where it starts failing. Unfortunately, as a PhD student, I must focus on more urgent matters despite I really like digging into WRF. For now, I see no other way to understand what's wrong and what is the cause.

Ming Chen · Dec 18, 2023

WRFV3.6.1 is pretty old version of WRF. It is not our top priority to debug any issues in such an old version of WRF.

Is it possible that you update the code to newer version, e.g., WRFV4.5?

Arty · Dec 18, 2023

Ming Chen said:
WRFV3.6.1 is pretty old version of WRF. It is not our top priority to debug any issues in such an old version of WRF.

Is it possible that you update the code to newer version, e.g., WRFV4.5?

Of course, I understand that V3.6 is old. Initially I was on V4.2 ; but troubles of unknown origin (which was discovered later) forced me to switch to and test V3.6 for compatibility reasons : the wrfout* files I was and am using have been created by this version. However I guess that, now I fixed what may have been the cause of multiple failures, I could go back to previously used or maybe latest version available.

I wonder, however, if it could be possible for NDOWN to take care of wrflowinp as well as sub-domains in the future ? As mentioned in the latest User's Guide and the previous ones, it is indeed cumbersome to create wrfinput file for nested domains - having to compute a short WRF run and repeat the NDOWN process. That said, I could easily understand that a jump from 2-times ratio 5 horizontal refinement could lead to an inaccurate interpolation on the finer grid.
In my case (tropical islands in the middle of the South Pacific ocean), I use 2-times ratio 3 horizontal refinement to get from 21km to 7km and then 2.333 km. Am I right considering it's not too big a deal to interpolate with such a ratio (9) ? For example, I could also try and create a wrfinput file via NDOWN for my d02 by setting a ratio 9 instead of 3 in the namelist and switching wrfndi to the correct domain's file.

Finally, please note that your support is much appreciated, and despite my willingness to improve the workflow, I enjoy working and learning with WRF.

Segmentation fault while running physics schemes tests with NDOWN inputs

Arty

Member

Attachments

Arty

Member

Ming Chen

Moderator

Arty

Member

Ming Chen

Moderator

Arty

Member