Tim Raupach
New member
Hi all,
I'm attempting to run a three-month simulation with WRF using ERA5 boundary conditions. I have successfully generated the metgrid files and real.exe runs fine for small time subsets. When I try to run real.exe for the full simulation time it segfaults, always during loop number 626. The segmentation fault is always in the last parallel process.
Things I have tried testing:
- Corruption in specific metgrid file? No, because different start times produce the same crash in the 626th step.
- I don't think the file system is running out of disk space.
- Using different numbers of processors and running in serial or with mpirun gives the same result.
At the point the crash occurs, real.exe is trying to write some output:
However, I can make the code run through the same
The error occurs in
I could run for shorter simulation times and use restart files, but I want to understand the error to ensure that it is not related to bad inputs, for example.
Here is output in rsl.out:
Here is the output in rsl.error:
Any advice would be much appreciated.
Thanks,
Tim
I'm attempting to run a three-month simulation with WRF using ERA5 boundary conditions. I have successfully generated the metgrid files and real.exe runs fine for small time subsets. When I try to run real.exe for the full simulation time it segfaults, always during loop number 626. The segmentation fault is always in the last parallel process.
Things I have tried testing:
- Corruption in specific metgrid file? No, because different start times produce the same crash in the 626th step.
- I don't think the file system is running out of disk space.
- Using different numbers of processors and running in serial or with mpirun gives the same result.
At the point the crash occurs, real.exe is trying to write some output:
Code:
CALL wrf_message ( 0 , 'Troubles, the interpolating order is too large for this few input values' )
CALL wrf_message ( 0 , 'This is usually caused by bad pressures' )
CALL wrf_message ( 0 , 'At this (i,j), look at the input value of pressure from metgrid' )
CALL wrf_message ( 0 , 'The surface pressure and the sea-level pressure should be reviewed, also from metgrid' )
CALL wrf_message ( 0 , 'Finally, ridiculous values of moisture can mess up the vertical pressures, especially aloft' )
CALL wrf_message ( 0 , 'The variable type is ' // var_type // '. This is not a unique identifer, but a type of field' )
CALL wrf_message ( 0 , 'Check to see if all time periods with this data fail, or just this one' )
However, I can make the code run through the same
(i,j,time)
combination by changing the start time, so it doesn't look like bad pressures or moisture. The var_type
is U
, for what it's worth. The error occurs in
lagrange_setup
in module_initialize_real.F
. I put some debugging print statements into the Fortran code and saw that where kinterp_end
is usually ~35, at the point it crashes it is 2. The only place I can kinterp_end
being reduced in the initialisation code is where zap_close_levels
is taken into account - which removes levels that are too close together? - but the mystery is why levels would get closer over time and why so many would be removed, and always at loop number 626. The negative values in f array
(see output below), and the fact there are only two values in both the f and p arrays, is also different to previous loops.I could run for shorter simulation times and use restart files, but I want to understand the error to ensure that it is not related to bad inputs, for example.
Here is output in rsl.out:
Code:
Domain 1: Current date being processed: 2013-12-16_01:00:00.0000, which is loop # 626 out of 1753
configflags%julyr, %julday, %gmt: 2013 350 1.000000
Yes, this special data is acceptable to use: OUTPUT FROM METGRID V4.4
Input data is acceptable to use:
metgrid input_wrf.F first_date_input = 2013-12-16_01:00:00
metgrid input_wrf.F first_date_nml = 2013-11-20_00:00:00
Timing for input 2 s.
flag_soil_layers read from met_em file is 1
Using sfcprs to compute psfc
all_dim = 2
order = 2
i,j = 236 232
p array = 2.705182 4.605170
f array = -0.9515590 -48.48717
p target= 11.51083 11.50411 11.49554 11.48465
11.47089 11.45367 11.43230 11.40609 11.37438
11.33709 11.29708 11.25706 11.21705 11.17703
11.13701 11.09700 11.05698 11.01695 10.97693
10.93691 10.89689 10.85687 10.81685 10.77684
10.73682 10.69680 10.65679 10.61678 10.57677
10.53676 10.49675 10.45674 10.41674 10.37674
10.33674 10.29674 10.25674 10.21675 10.17676
10.13677 10.09678 10.05680 10.01681 9.976829
9.936844 9.896859 9.856874 9.816891 9.776906
9.736921 9.696937 9.656953 9.616968 9.576983
9.536999 9.497014 9.457030 9.417046 9.377061
9.337076 9.297092 9.257108 9.217123 9.177138
9.137155 9.097170 9.057185 9.017200 8.977216
8.937232 8.897247 8.857263 8.817278 8.777293
8.737309 8.697325 8.657340 8.617356 8.577372
8.537386
Here is the output in rsl.error:
Code:
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41d7000)
==== backtrace (tid:1040727) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000003d5263a __intel_avx_rep_memcpy() ???:0
2 0x0000000003cbee06 for_trim() ???:0
3 0x000000000090fa85 wrf_message_() WRF_v4.4/WRF/frame/module_wrf_error.f90:151
4 0x00000000004c9450 module_initialize_real_mp_lagrange_setup_() WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:6137
5 0x00000000004c87e6 module_initialize_real_mp_vert_interp_() WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:6051
6 0x0000000000472172 module_initialize_real_mp_init_domain_rk_() WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:2594
7 0x000000000043faa1 module_initialize_real_mp_init_domain_() WRF_v4.4/WRF/main/../dyn_em/module_initialize_real.f90:53
8 0x000000000041520c med_sidata_input_() WRF_v4.4/WRF/main/real_em.f90:470
9 0x0000000000413a6c MAIN__() WRF_v4.4/WRF/main/real_em.f90:247
10 0x0000000000412ba2 main() ???:0
11 0x000000000003ad85 __libc_start_main() ???:0
12 0x0000000000412aae _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
real.exe 0000000003C5CB83 Unknown Unknown Unknown
libpthread-2.28.s 0000146341B2FCF0 Unknown Unknown Unknown
real.exe 0000000003D5263A Unknown Unknown Unknown
real.exe 0000000003CBEE06 Unknown Unknown Unknown
real.exe 000000000090FA85 wrf_message_ 151 module_wrf_error.f90
real.exe 00000000004C9450 module_initialize 6137 module_initialize_real.f90
real.exe 00000000004C87E6 module_initialize 6051 module_initialize_real.f90
real.exe 0000000000472172 module_initialize 2594 module_initialize_real.f90
real.exe 000000000043FAA1 module_initialize 53 module_initialize_real.f90
real.exe 000000000041520C med_sidata_input_ 470 real_em.f90
real.exe 0000000000413A6C MAIN__ 247 real_em.f90
real.exe 0000000000412BA2 Unknown Unknown Unknown
libc-2.28.so 000014634158DD85 __libc_start_main Unknown Unknown
real.exe 0000000000412AAE Unknown Unknown Unknown
Any advice would be much appreciated.
Thanks,
Tim