Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

(RESOLVED) Unknown origin segmentation fault --> input data from ndown are corrupted on some variables

Arty

Member
Hello,

Edit : I first thought the problem came from the ref lat/lon in wrfbdy_d02 output from ndown.exe, but after another dedicated try-run (totally different config, CFSR forced), it appears that wrfbdy_d02 always have ref lat/lon from d01 instead of d02 (see second post below). Could anyone confirm it is normal ?

About the most common reasons for segfault :
1) I did several tests with less or more procs : same results, not the problem. OK
2) I checked for disk space : OK
3) Input data seems OK, WPS, REAL & NDOWN = success : OK --> NOK segfault probable origin : corrupted ndown outputs on some variables (notably pressures, Qvapor) - I had only checked T2, TSK, SST, U10 & V10 which were fine so I didn't get any further, mea culpa
4) I checked for CFL errors : OK
5) I checked for memory size : OK

More info : I work on "Datarmor" supercomputer. And I thounght of a problem caused by ref_lat/lon because I already had a segfault probably caused by a typo on this vairiables in a previous run namelist (see : (RESOLVED) forrtl: severe (174): SIGSEGV, segmentation fault occurred --> namelist typo)


If anyone got an idea : ndown.exe normally outputs 2 files : wrfbdy_d02 and wrfinput_d02 ; these two gridded datasets should be (as I thought) on the same grid but here, I only get wrfinput_d02 well centered :

> nch wrfinput_d02 | gall (i.e. ncdump -h $1 | grep <wanted variables>)
:WEST-EAST_GRID_DIMENSION = 121 ;
:SOUTH-NORTH_GRID_DIMENSION = 121 ;
:BOTTOM-TOP_GRID_DIMENSION = 33 ;
DX = 7000.f ;
DY = 7000.f ;
:USE_THETA_M = 0 ;
:GWD_OPT = 0 ;
:GRID_ID = 2 ;
:I_PARENT_START = 318 ;
:J_PARENT_START = 65 ;
DT = 40.f ;
:CEN_LAT = -17.62045f ; --> OK
:CEN_LON = -149.5606f ; --> OK
> nch wrfbdy_d02 | gall
:WEST-EAST_GRID_DIMENSION = 121 ;
:SOUTH-NORTH_GRID_DIMENSION = 121 ;
:BOTTOM-TOP_GRID_DIMENSION = 33 ;
DX = 7000.f ;
DY = 7000.f ;
:USE_THETA_M = 0 ;
:GWD_OPT = 0 ;
:GRID_ID = 2 ;
:I_PARENT_START = 318 ;
:J_PARENT_START = 65 ;
DT = 40.f ;
:CEN_LAT = -17.90401f ; --> NOK
:CEN_LON = -172.7851f ; --> NOK

Note that only CEN_LAT and CEN_LON are wrong compared to the grid of wrfinput_d02.

Moreover, I don't seem to get an error in the ndown.exe rsl* files (which made me doubt little being it the cause of my problem) :

> scs (equivalent to grep 'SUCCESS' rsl.e*)
rsl.error.0000: ndown_em: SUCCESS COMPLETE NDOWN_EM INIT
rsl.error.0001: ndown_em: SUCCESS COMPLETE NDOWN_EM INIT
rsl.error.0002: ndown_em: SUCCESS COMPLETE NDOWN_EM INIT
rsl.error.0003: ndown_em: SUCCESS COMPLETE NDOWN_EM INIT
... and so on for all rsl.error*

Of course, when I try to run WRF with these as inputs (after having switched names *d02 to *d01 as necessary), if fails and the WRF's rsl* files read :

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Please, help.

Configs short description :
Coarse is composed of 3 domains with d01 centered on 17.90401°S / 172.7851°W and nested d02/d03 both centered on 17.62045°S / 149.5606°W
Fine is composed of 2 domains with d01/d02 respectively equivalent to Coarse's d02/d03 above, hence also centered on 17.62045°S / 149.5606°W
See the namelist.input* attached.
 

Attachments

  • namelist.input (Fine-real).txt
    8.7 KB · Views: 4
  • namelist.input (Coarse-real).txt
    8.7 KB · Views: 2
  • namelist.input (Coarse-ndown).txt
    8.7 KB · Views: 3
Last edited:
Apparently "ill centered" wrfbdy file seems not to be a problem ; I ran another configration test with success ; I ran d01 on CFSR only for 5 days and then ran new_d01 (nested d02 from previous d01) via ndown.exe on wrfout* of previous run.

nch wrfinput_d01_2005_02 | gall
:WEST-EAST_GRID_DIMENSION = 121 ;
:SOUTH-NORTH_GRID_DIMENSION = 121 ;
:BOTTOM-TOP_GRID_DIMENSION = 33 ;
DX = 2500.f ;
DY = 2500.f ;
:USE_THETA_M = 1 ;
:GWD_OPT = 1 ;
:GRID_ID = 2 ;
:I_PARENT_START = 61 ;
:J_PARENT_START = 41 ;
DT = 15.f ;
:CEN_LAT = -17.50002f ; --> Original d02 cen_lat
:CEN_LON = -148.0853f ; --> Original d02 cen_lon (small shift of 20*dx)
nch wrfbdy_d01_2005_02 | gall
:WEST-EAST_GRID_DIMENSION = 121 ;
:SOUTH-NORTH_GRID_DIMENSION = 121 ;
:BOTTOM-TOP_GRID_DIMENSION = 33 ;
DX = 2500.f ;
DY = 2500.f ;
:USE_THETA_M = 1 ;
:GWD_OPT = 1 ;
:GRID_ID = 2 ;
:I_PARENT_START = 61 ;
:J_PARENT_START = 41 ;
DT = 15.f ;
:CEN_LAT = -17.50001f ; --> Original d01 cen_lat
:CEN_LON = -149.5f ; --> Original d01 cen_lon
 
Last edited:
Some more info. I checked the difference between original wrfout* I'm using via ndown (from WRF 3.6.1) and the wrfinput_d01 of my run (WPS and WRF 4.2.1). Here's the result (original vs. new configuration) :

diff original.txt new.txt
2,4c2,4
< TITLE = " OUTPUT FROM WRF V3.6.1 MODEL" ;
< START_DATE = "2005-01-01_000000" ;
< SIMULATION_START_DATE = "1980-01-01_000000" ;
---
> TITLE = " OUTPUT FROM REAL_EM V4.2.1 PREPROCESSOR" ;
> START_DATE = "2005-01-01_060000" ;
> SIMULATION_START_DATE = "2005-01-01_060000" ;
10c10,18
< STOCH_FORCE_OPT = 0 ;
---
> AERCU_OPT = 0 ;
> AERCU_FCT = 1.f ;
> IDEAL_CASE = 0 ;
> DIFF_6TH_SLOPEOPT = 0 ;
> AUTO_LEVELS_OPT = 2 ;
> DIFF_6TH_THRESH = 0.1f ;
> DZBOT = 20.f ;
> DZSTRETCH_S = 1.3f ;
> DZSTRETCH_U = 1.1f ;
15c23
< DAMPCOEF = 0.1f ;
---
> DAMPCOEF = 0.2f ;
18,20c26,28
< MP_PHYSICS = 2 ;
< RA_LW_PHYSICS = 3 ;
< RA_SW_PHYSICS = 3 ;
---
> MP_PHYSICS = 6 ;
> RA_LW_PHYSICS = 1 ;
> RA_SW_PHYSICS = 1 ;
22,24c30,32
< SF_SURFACE_PHYSICS = 2 ;
< BL_PBL_PHYSICS = 9 ;
< CU_PHYSICS = 7 ;
---
> SF_SURFACE_PHYSICS = 1 ;
> BL_PBL_PHYSICS = 1 ;
> CU_PHYSICS = 2 ;
34a43,46
> USE_THETA_M = 0 ;
> USE_MAXW_LEVEL = 0 ;
> USE_TROP_LEVEL = 0 ;
> GWD_OPT = 0 ;
36,65c48
< SHCU_PHYSICS = 2 ;
< MFSHCONV = 0 ;
< FEEDBACK = 1 ;
< SMOOTH_OPTION = 2 ;
< SWRAD_SCAT = 1.f ;
< W_DAMPING = 0 ;
< DT = 48.f ;
< RADT = 100.f ;
< BLDT = 0.f ;
< CUDT = 5.f ;
< AER_OPT = 0 ;
< SWINT_OPT = 0 ;
< AER_TYPE = 1 ;
< AER_AOD550_OPT = 1 ;
< AER_ANGEXP_OPT = 1 ;
< AER_SSA_OPT = 1 ;
< AER_ASY_OPT = 1 ;
< AER_AOD550_VAL = 0.12f ;
< AER_ANGEXP_VAL = 1.3f ;
< AER_SSA_VAL = 0.f ;
< AER_ASY_VAL = 0.f ;
< MOIST_ADV_OPT = 1 ;
< SCALAR_ADV_OPT = 2 ;
< TKE_ADV_OPT = 2 ;
< DIFF_6TH_OPT = 0 ;
< DIFF_6TH_FACTOR = 0.12f ;
< OBS_NUDGE_OPT = 0 ;
< BUCKET_MM = -1.f ;
< BUCKET_J = -1.f ;
< PREC_ACC_DT = 0.f ;
---
> SF_SURFACE_MOSAIC = 0 ;
67,75c50
< ISFTCFLX = 1 ;
< ISHALLOW = 1 ;
< ISFFLX = 1 ;
< ICLOUD = 1 ;
< ICLOUD_CU = 0 ;
< TRACER_PBLMIX = 1 ;
< SCALAR_PBLMIX = 0 ;
< GRAV_SETTLING = 0 ;
< DFI_OPT = 0 ;
---
> SIMULATION_INITIALIZATION_TYPE = "REAL-DATA CASE" ;
88,92c63,68
< GRID_ID = 2 ;
< PARENT_ID = 1 ;
< I_PARENT_START = 47 ;
< J_PARENT_START = 15 ;
< PARENT_GRID_RATIO = 5 ;
---
> GRID_ID = 1 ;
> PARENT_ID = 0 ;
> I_PARENT_START = 1 ;
> J_PARENT_START = 1 ;
> PARENT_GRID_RATIO = 1 ;
> DT = 120.f ;
95c71
< TRUELAT1 = -10.f ;
---
> TRUELAT1 = -17.90401f ;
97,98c73,74
< MOAD_CEN_LAT = -10.00001f ;
< STAND_LON = 201.6f ;
---
> MOAD_CEN_LAT = -17.90401f ;
> STAND_LON = -172.7851f ;
101c77
< GMT = 0.f ;
---
> GMT = 6.f ;
106,111c82,87
< MMINLU = "USGS" ;
< NUM_LAND_CAT = 24 ;
< ISWATER = 16 ;
< ISLAKE = -1 ;
< ISICE = 24 ;
< ISURBAN = 1 ;
---
> MMINLU = "MODIFIED_IGBP_MODIS_NOAH" ;
> NUM_LAND_CAT = 21 ;
> ISWATER = 17 ;
> ISLAKE = 21 ;
> ISICE = 15 ;
> ISURBAN = 13 ;
112a89,90
> HYBRID_OPT = 2 ;
> ETAC = 0.1f ;

Is there any parameter that should necessarily be exactly the same ? Except for "use_thetha_m" and "gwd_opt", I didn't get any issue during WPS/REAL/NDOWN ; and I corrected them all.
 
Last edited:
Hi,
There are several things in your diffs that are concerning.
1. Why is the simulation start date different? And the start time is different, as well.
2. All of the landuse information is completely different (e.g., it looks like you're using MODIS for one simulation and USGS for the other)
3. If you are going to use older model data, you must turn off the hybrid option (which is set as default in V4.0+).
4. According to your namelist.input (coarse-ndown), you are trying to run ndown with 3 domains. This should only be set to 2. You should only try to run ndown from d01 - d02, and then after running wrf for d02, you then run everything again for d02 - d03. Are you following the instructions in the WRF Users' Guide (see Running ndown for three or more domains)?

All that aside, can you let me know your reason for choosing to run ndown, as opposed to just simulating the 3 domains together? Typically ndown is used when you've run a very long simulation (years) for a coarse grid and later decide you want to add a finer-resolution grid, or when the size of the domains are so different that it's impossible to use an appropriate number of processors that satisfies the rules for each domain. Neither of those reasons seems to be your case.
 
Hi,
There are several things in your diffs that are concerning.
1. Why is the simulation start date different? And the start time is different, as well.
2. All of the landuse information is completely different (e.g., it looks like you're using MODIS for one simulation and USGS for the other)
3. If you are going to use older model data, you must turn off the hybrid option (which is set as default in V4.0+).
4. According to your namelist.input (coarse-ndown), you are trying to run ndown with 3 domains. This should only be set to 2. You should only try to run ndown from d01 - d02, and then after running wrf for d02, you then run everything again for d02 - d03. Are you following the instructions in the WRF Users' Guide (see Running ndown for three or more domains)?

All that aside, can you let me know your reason for choosing to run ndown, as opposed to just simulating the 3 domains together? Typically ndown is used when you've run a very long simulation (years) for a coarse grid and later decide you want to add a finer-resolution grid, or when the size of the domains are so different that it's impossible to use an appropriate number of processors that satisfies the rules for each domain. Neither of those reasons seems to be your case.
Hi kwerner,

I gladly admit my case is not simple, to say the least. First I'd like to answer your questions/remarks.

1) I'm using "old" 30-years simulations from 2018 (that I didn't configure nor run myself). For my configuration tests, I only run 5 days in early 2005 ; when the configuration works well, I'll extend the period.

2) Could this be a " fatal " problem ?

3) Thanks, I did not know that ; I read only about the "force_use_old_data".

4) I've tried both configurations (coarse with 3 domains, and with only 2). I understand there's no need to run the 3rd domain as its already taken care of in "fine" configuration. I followed (and read more than once) the User's Guide on that part and also the step by step tutorial ; but I should note it doesn't make it easy for cases like this one.
I shoud also insist that I never had any error about this 3rd domain, strangely. But I'll keep Coarse configuration to 2 domains only from now.

Why using ndown ?

I think my answer to 1) gave a bit of a hint on that issue : I don't dispose the other would-be-needed data ; that's already great I can use the wrfouts at 21 km over my region of interest. As you described it, ndown seems pretty good for my case (old coarse run, later decide to refine).
You shoud also know that I've looked over UPP as you advised me earlier this year, but there were incompatibilities for my use ; moreover the service is, unfortunately, no more updated. So it seems I'm "stucked" (in every meanings) with ndown, without any other possibility that I know of. But it's OK as I'm confident I'll make it work with some help.

Does my answers help a little ? Please feel free to ask for any needed details or data.

Thank you for your time, I really appreciate it.

Edit : I just tried, hybrid_opt = 0 added to WRF 4.2.1 namelists does not suffice to avoid the segfault.
 
Last edited:
New update : I'm trying to fit my namelist to that of the old run, still encountering errors about parameters compatibilities, hence iterating namelist update, but no segfault (for the time being). Currently I'm getting that :

tuning parameters zm_convi: tau 600.000000000000
d01 2005-02-02_06:00:00 Input data is acceptable to use:
d01 2005-02-02_06:00:00 Input data processed for aux input 4 for domain 1
d01 2005-02-02_06:00:00 Input data is acceptable to use:
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS 73 IE 96 JS 49 JE 72
WRF NUMBER OF TILES = 1
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 137
**** ZM_CONV IENTROPY: Tmix did not converge ****
-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 13


But I'll open a new thread if I don't find a solution by myself before burning out. In the meantime, any advise is welcome, of course.
 
Hi,

1) I'm using "old" 30-years simulations from 2018 (that I didn't configure nor run myself). For my configuration tests, I only run 5 days in early 2005 ; when the configuration works well, I'll extend the period.

1) It may be necessary to start from the initial date/time for things to run smoothly. Unfortunately the ndown program is very fickle and everything has to be perfectly aligned for it to work out. I'm not certain, but I believe the dates/times have to be the same. You could test this to see.

2) Could this be a " fatal " problem ?

2) I believe it could potentially be related. It would be an interesting test to see if you still get the segmentation fault issue when you're using the same landuse data as was used for the original 30-year simulation.

4)I've tried both configurations (coarse with 3 domains, and with only 2). I understand there's no need to run the 3rd domain as its already taken care of in "fine" configuration. I followed (and read more than once) the User's Guide on that part and also the step by step tutorial ; but I should note it doesn't make it easy for cases like this one.

I shoud also insist that I never had any error about this 3rd domain, strangely. But I'll keep Coarse configuration to 2 domains only from now.

4) I meant that you should only run ndown for 2 domains at a time. You can run the full program later from d02 - d03, but just not at the same time as configuring d01-d02. Maybe you understood that, but I just wanted to make sure!

Regarding your most recent error:
Code:
undefined

Are you now switching to the Zhang-McFarlane cu scheme (#7)? This error indicates some sort of issue with that, but I'm not sure what it means. Do you have the original namelist used in the previous 30-year simulation? If you do, can you attach that, along with the latest namelist.input you're using? Can you also let me know which version of WRF you're running? Thanks!
 
1) It may be necessary to start from the initial date/time for things to run smoothly. Unfortunately the ndown program is very fickle and everything has to be perfectly aligned for it to work out. I'm not certain, but I believe the dates/times have to be the same. You could test this to see.
This would be heavily problematic as the 30-years run (35 to be accurate) starts from 1980 and I only need to start no earlier than 1995 ; for both computational time constraints and meteo-station data availability. Is there anyway to "cheat", for example, by changing the run start-date in wrfout* NetCDFs ?
2) I believe it could potentially be related. It would be an interesting test to see if you still get the segmentation fault issue when you're using the same landuse data as was used for the original 30-year simulation.
I'm still on it. I'll get back on that matter when I get how to link the right directories.
4) I meant that you should only run ndown for 2 domains at a time. You can run the full program later from d02 - d03, but just not at the same time as configuring d01-d02. Maybe you understood that, but I just wanted to make sure!
Yes, that was clear to me, thank you.
Regarding your most recent error:
Code:
undefined

Are you now switching to the Zhang-McFarlane cu scheme (#7)? This error indicates some sort of issue with that, but I'm not sure what it means. Do you have the original namelist used in the previous 30-year simulation? If you do, can you attach that, along with the latest namelist.input you're using? Can you also let me know which version of WRF you're running? Thanks!
Indeed : I got the namelist from the previous/original 30-years run and I edited my namelist as close as possible to the first one. I started one variable at a time, until I got this ZM CONV... error ; then I did all the change I could at once, all the physics and related options are the same and I got back to segfault.
The only left differences are now, still, those concerning landuse ; and also all the aerosols variables - but maybe I should have ? Do them require WRF-Chem ?)

See the namelists attached. Currently running under 4.2.1

Last but not least : Even though I already checked ndown output files with ncview noticing nothing abnormal on T2, SST, TSK, U10 and V10, it appeared to me, today, that surface pressure values are far off (10^22 order of magnitude) as other variables, like Qvapor (10^19). That's a real pity I didn't get passed the five previous variables I checked... But now the problem's origin seems to move from WRF to NDOWN.
Any thoughts about why some variables in ndown outputs seem correct whereas others like psfc are way off ?

Edit : I opened a new thread on that matter.
 

Attachments

  • actual_namelist.txt
    8.8 KB · Views: 9
  • original_namelist.txt
    7.6 KB · Views: 2
Last edited:
I ran a test with ndown, starting the nest domain at a later time. This should work as long as your real.exe run d02 start date/time is from that later time, and then when you run ndown.exe, both domains should be set to the later time for the start time.

Regarding the actual_namelist.txt file and the original_namelist.txt file, are you using a different input data, as well? I notice the num_metgrid_levels is 18 for the original namelist, and then 38 for the new namelist. I'm not sure how the model is going to handle that

I see several new namelist parameters added to the new namelist. I would suggest starting with an identical namelist to the original, and only modifying it to account for the new domain you are trying to insert. Just try to start with something simple to see if you can even get this to work, and then try adding other parameters slowly to see where the problems arise. Trying to use several new options, plus FDDA, and potentially coupling(?), etc. will make it very difficult to track down any issues.
 
I ran a test with ndown, starting the nest domain at a later time. This should work as long as your real.exe run d02 start date/time is from that later time, and then when you run ndown.exe, both domains should be set to the later time for the start time.
That's the case. I didn't get any issue with start time since I had to correct for start hour (from default 00:00 to 06:00, as old wrfout* start at 06:00).
Regarding the actual_namelist.txt file and the original_namelist.txt file, are you using a different input data, as well? I notice the num_metgrid_levels is 18 for the original namelist, and then 38 for the new namelist. I'm not sure how the model is going to handle that
Indeed : I run WPS with CFSR data on 38 levels. I think NCEP2 was used in the previous runs, but I should rather ask. Nevertheless, I don't see how it could be problematic, afterall WPS/REAL works fine and the only NDOWN inputs from preivous runs are the wrfout* which are on 33 eta levels that I duplicated in my config and are applied to the REAL outputs. Could you explain what are your thoughts on the matter ?
I see several new namelist parameters added to the new namelist. I would suggest starting with an identical namelist to the original, and only modifying it to account for the new domain you are trying to insert. Just try to start with something simple to see if you can even get this to work, and then try adding other parameters slowly to see where the problems arise. Trying to use several new options, plus FDDA, and potentially coupling(?), etc. will make it very difficult to track down any issues.
Could you please confirm, to your knowledge, if the lot of aer* variables (for GHG effects as I understand) and other ones won't be a problem ? I didn't set anything about what appeared to me as relevant of WRF-Chem (but I can be all wrong about it) and that is why, for now, I didn't completely copy/paste the exact same namelist.

Thank you.
 
Sorry for the delay. I was trying to run some tests to see if I could replicate your case with two different input data types and I can't figure out how you were able to get real.exe to work with both datasets, if you were running real.exe for 2 domains. The issue is that you have to set num_metgrid_levels in the namelist and since there are 2 values (one for each domain), this isn't possible since that namelist parameter does not allow multiple column entries. I then tried to run real separately for each case, but eventually come to the issue that the i_parent_start locations are not correct for the nest, unless I start the nest at 1, 1, which isn't a reasonable solution.
 
Sorry for the delay. I was trying to run some tests to see if I could replicate your case with two different input data types and I can't figure out how you were able to get real.exe to work with both datasets, if you were running real.exe for 2 domains. The issue is that you have to set num_metgrid_levels in the namelist and since there are 2 values (one for each domain), this isn't possible since that namelist parameter does not allow multiple column entries. I then tried to run real separately for each case, but eventually come to the issue that the i_parent_start locations are not correct for the nest, unless I start the nest at 1, 1, which isn't a reasonable solution.
No problem, thank you.

I created another thread focused on the ndown.exe corrupted output (HERE), where I attached some (3) of the wrfout* I'm working with (see cloud) and all the files (namelists...) that I felt necessary. Feel free to ask any information if you wish to test running ndown with those files.
 
Top