Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault after N time steps in real.exe with ERA5 boundary conditions

Tim Raupach

New member
Hi all,

I'm attempting to run a three-month simulation with WRF using ERA5 boundary conditions. I have successfully generated the metgrid files and real.exe runs fine for small time subsets. When I try to run real.exe for the full simulation time it segfaults, always during loop number 626. The segmentation fault is always in the last parallel process.

Things I have tried testing:
- Corruption in specific metgrid file? No, because different start times produce the same crash in the 626th step.
- I don't think the file system is running out of disk space.
- Using different numbers of processors and running in serial or with mpirun gives the same result.

At the point the crash occurs, real.exe is trying to write some output:

Code:
CALL wrf_message ( 0 , 'Troubles, the interpolating order is too large for this few input values' )
CALL wrf_message ( 0 , 'This is usually caused by bad pressures' )
CALL wrf_message ( 0 , 'At this (i,j), look at the input value of pressure from metgrid' )
CALL wrf_message ( 0 , 'The surface pressure and the sea-level pressure should be reviewed, also from metgrid' )
CALL wrf_message ( 0 , 'Finally, ridiculous values of moisture can mess up the vertical pressures, especially aloft' )
CALL wrf_message ( 0 , 'The variable type is ' // var_type // '. This is not a unique identifer, but a type of field' )
CALL wrf_message ( 0 , 'Check to see if all time periods with this data fail, or just this one' )

However, I can make the code run through the same (i,j,time) combination by changing the start time, so it doesn't look like bad pressures or moisture. The var_type is U, for what it's worth.

The error occurs in lagrange_setup in module_initialize_real.F. I put some debugging print statements into the Fortran code and saw that where kinterp_end is usually ~35, at the point it crashes it is 2. The only place I can kinterp_end being reduced in the initialisation code is where zap_close_levels is taken into account - which removes levels that are too close together? - but the mystery is why levels would get closer over time and why so many would be removed, and always at loop number 626. The negative values in f array (see output below), and the fact there are only two values in both the f and p arrays, is also different to previous loops.

I could run for shorter simulation times and use restart files, but I want to understand the error to ensure that it is not related to bad inputs, for example.

Here is output in rsl.out:

Code:
Domain  1: Current date being processed: 2013-12-16_01:00:00.0000, which is loop # 626 out of 1753
 configflags%julyr, %julday, %gmt:        2013         350   1.000000   
 Yes, this special data is acceptable to use: OUTPUT FROM METGRID V4.4
 Input data is acceptable to use:
 metgrid input_wrf.F first_date_input = 2013-12-16_01:00:00
 metgrid input_wrf.F first_date_nml = 2013-11-20_00:00:00
Timing for input          2 s.
         flag_soil_layers read from met_em file is  1
Using sfcprs  to compute psfc
 all_dim =            2
 order =            2
 i,j =          236         232
 p array =    2.705182       4.605170   
 f array =  -0.9515590      -48.48717   
 p target=    11.51083       11.50411       11.49554       11.48465   
   11.47089       11.45367       11.43230       11.40609       11.37438   
   11.33709       11.29708       11.25706       11.21705       11.17703   
   11.13701       11.09700       11.05698       11.01695       10.97693   
   10.93691       10.89689       10.85687       10.81685       10.77684   
   10.73682       10.69680       10.65679       10.61678       10.57677   
   10.53676       10.49675       10.45674       10.41674       10.37674   
   10.33674       10.29674       10.25674       10.21675       10.17676   
   10.13677       10.09678       10.05680       10.01681       9.976829   
   9.936844       9.896859       9.856874       9.816891       9.776906   
   9.736921       9.696937       9.656953       9.616968       9.576983   
   9.536999       9.497014       9.457030       9.417046       9.377061   
   9.337076       9.297092       9.257108       9.217123       9.177138   
   9.137155       9.097170       9.057185       9.017200       8.977216   
   8.937232       8.897247       8.857263       8.817278       8.777293   
   8.737309       8.697325       8.657340       8.617356       8.577372   
   8.537386

Here is the output in rsl.error:

Code:
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41d7000)
==== backtrace (tid:1040727) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000003d5263a __intel_avx_rep_memcpy()  ???:0
 2 0x0000000003cbee06 for_trim()  ???:0
 3 0x000000000090fa85 wrf_message_()  WRF_v4.4/WRF/frame/module_wrf_error.f90:151
 4 0x00000000004c9450 module_initialize_real_mp_lagrange_setup_()  WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:6137
 5 0x00000000004c87e6 module_initialize_real_mp_vert_interp_()  WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:6051
 6 0x0000000000472172 module_initialize_real_mp_init_domain_rk_()  WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:2594
 7 0x000000000043faa1 module_initialize_real_mp_init_domain_()  WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:53
 8 0x000000000041520c med_sidata_input_()  WRF_v4.4/WRF/main/real_em.f90:470
 9 0x0000000000413a6c MAIN__()  WRF_v4.4/WRF/main/real_em.f90:247
10 0x0000000000412ba2 main()  ???:0
11 0x000000000003ad85 __libc_start_main()  ???:0
12 0x0000000000412aae _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
real.exe           0000000003C5CB83  Unknown               Unknown  Unknown
libpthread-2.28.s  0000146341B2FCF0  Unknown               Unknown  Unknown
real.exe           0000000003D5263A  Unknown               Unknown  Unknown
real.exe           0000000003CBEE06  Unknown               Unknown  Unknown
real.exe           000000000090FA85  wrf_message_              151  module_wrf_error.f90
real.exe           00000000004C9450  module_initialize        6137  module_initialize_real.f90
real.exe           00000000004C87E6  module_initialize        6051  module_initialize_real.f90
real.exe           0000000000472172  module_initialize        2594  module_initialize_real.f90
real.exe           000000000043FAA1  module_initialize          53  module_initialize_real.f90
real.exe           000000000041520C  med_sidata_input_         470  real_em.f90
real.exe           0000000000413A6C  MAIN__                    247  real_em.f90
real.exe           0000000000412BA2  Unknown               Unknown  Unknown
libc-2.28.so       000014634158DD85  __libc_start_main     Unknown  Unknown
real.exe           0000000000412AAE  Unknown               Unknown  Unknown

Any advice would be much appreciated.

Thanks,
Tim
 
Hi Tim,
What version of the model are you using? What are the sizes of the wrfbdy_d01 and wrfinput* files at the time it's stopping? Can you also attach your namelist.input file? Thanks!
 
Hi Karl,

Thanks for your reply. I'm using WRF v4.4. Sizes of those files are:

Code:
6.8G    wrfbdy_d01
105M    wrfinput_d01

And the namelist is attached.

Thanks for any help you can offer!

Cheers,
Tim
 

Attachments

  • namelist.input
    3.8 KB · Views: 4
Hi all,

As a follow up to my question - I checked whether the fields in wrfbdy_d01 generated using different start times match for a given (common) timestep, and found that the fields differ. Is it normal for the values inside wrfbdy_d01 to depend on the simulation start time? My understanding was that real.exe does vertical interpolation of the boundary conditions so I was surprised they changed when different start times were used.

Thanks for your help,
Tim
 
Hi Tim,
No, the boundary conditions should be the same for the same time stamp since the data come from the first-guess input data. Can you show me the differences you're seeing?

Regarding your existing wrfbdy file, that is pretty large, and though there isn't a limit if you're using NetCDF v4+, there can still be limitations in your environment. I would recommend talking to a systems administrator at your institution to determine why you seem to be unable to write a file larger than 6.8GB. Alternatively, there is an option that may be useful for you. Read here about using multiple lateral condition files.
 
Hi Kelly,

Thanks for the reply. I have rechecked and the boundary conditions are indeed the same for the same time with different start dates - so there seems to be no problem there after all. Thanks for the advice about using multiple lateral condition files. I will give that a try.

By the way, apologies for getting your name wrong in an earlier post.

Best wishes,
Tim
 
Hi all,
I am also facing this problem with wrf-chem. It failed with a similar issue but only with the Mozart-mosaic with Mozart which is computationally less costly.
Please suggest any change to whatever is required.
 
Top