Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Unexpected termination of wrf.exe: BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

Hilson Liu

New member
Hi,

I am running WRF-UCM for my three domains. The start time has been settled as 2023-07-15 and the end time as 2023-08-31. But when the simulation ran 20 hours later, there were some unexpected termination as follows. During the whole running process, no other errrors or warnings. And I have used 'ulimit -s unlimited' before I started as 'mpirun -np 12 ./wrf.exe'. Before this simulation, with the same domains' settings, a three-day simulation is avaliable(in WRF-default or WRF-solar). Meteorological data obtained from ERA5 of CDS, including single layer and pressure layer data.

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 62817 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
....
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 62826 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

To find out the problems, I have attached my files, including 'namelist.wps', 'namelist.input' and 'rsl.error.000*'. The wrfout path has changed by myself and i'm sure that it has enough space(3.7T available) to storage the wrfout files.

Looking forward to any replies! The unexpected terminations really confused me.

Regards,
Hilson
 

Attachments

  • unexpected termination information 1203.zip
    729.8 KB · Views: 2
Last edited:
Hilson,

The error messages you posted don't really help because they don't tell any valuable information.

Can you recompile WRF in debug mode, i..e, ./configure -D, then recompile and run the case again? If you have wrfrst files saved, it is fine to restart this case. I would expect that the model will crash at the same time, but with detailed information when and where it crashed first. Such information will be helpful to identify possible reasons for the crash.

Since the model has run ~ 20 hours, I don't think this is a data issue. There might be something wrong in the physics.
 
Hilson,

The error messages you posted don't really help because they don't tell any valuable information.

Can you recompile WRF in debug mode, i..e, ./configure -D, then recompile and run the case again? If you have wrfrst files saved, it is fine to restart this case. I would expect that the model will crash at the same time, but with detailed information when and where it crashed first. Such information will be helpful to identify possible reasons for the crash.

Since the model has run ~ 20 hours, I don't think this is a data issue. There might be something wrong in the physics.
Hi,

Really thank you for your reply!
I have recompile WRF in debug mode as your way. But some strange things happened.......
When I restarted this simulation, it suddenly crashed just few minutes later. And the error messages are the same, I will clip them as follows:

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 157869 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

And I have attached the file 'rsl.error.0000' for this time. In this simulation, it crashed at a new place. The namelist.input has no change but I also attached it.
Can this messages and files help you to sure that what's wrong in my settings, modules or other potential parts?


Regards,
Hilson
 

Attachments

  • namelist.input
    4.8 KB · Views: 4
  • rsl.error.0000
    29.4 KB · Views: 5
Last edited:
Hilson,

There is no error message in the rsl file you attached. Note that error messages don't necessarily exist in rsl.error.0000. They can be in any rsl files. Can you look at all your rsl files and find some helpful information?

If WRF is recompiled in debug mode, then you should find when and where the model crashed in your log file.

In addition, this is a triply nested case, which adds extra trouble to debug what is wrong. Can you rerun this case over a single domain? This will run quickly and the results will help to narrow down possible issues.
 
Hilson,

There is no error message in the rsl file you attached. Note that error messages don't necessarily exist in rsl.error.0000. They can be in any rsl files. Can you look at all your rsl files and find some helpful information?

If WRF is recompiled in debug mode, then you should find when and where the model crashed in your log file.

In addition, this is a triply nested case, which adds extra trouble to debug what is wrong. Can you rerun this case over a single domain? This will run quickly and the results will help to narrow down possible issues.
Hi,

Referring to your suggestion, I tried again. This time I got some potentially useful error messages in 'rsl.error.0000':
Code:
 SOIL TEXTURE CLASSIFICATION = STAS FOUND          19  CATEGORIES
forrtl: severe (408): fort: (2): Subscript #1 of the array SLA_TABLE has value 58 which is greater than the upper bound of 27

Image              PC                Routine            Line        Source             
wrf.exe            0000000008772C72  module_sf_noahmpd        2155  module_sf_noahmpdrv.f90
wrf.exe            00000000089DBDA0  module_physics_in        3117  module_physics_init.f90
wrf.exe            0000000008984DF0  module_physics_in        1347  module_physics_init.f90
wrf.exe            00000000058EADBE  start_domain_em_         1067  start_em.f90
wrf.exe            0000000004CBA0CF  start_domain_             122  start_domain.f90
wrf.exe            00000000041AF987  med_initialdata_i         229  mediation_wrfmain.f90
wrf.exe            0000000000417573  module_wrf_top_mp         272  module_wrf_top.f90
wrf.exe            0000000000416EC3  MAIN__                     22  wrf.f90
wrf.exe            0000000000416E8D  Unknown               Unknown  Unknown
libc-2.28.so       000014788A0C5493  __libc_start_main     Unknown  Unknown
wrf.exe            0000000000416DAE  Unknown               Unknown  Unknown

This time for simulation I only used the smallest of the initial three domains, i.e. dx=dy=1km, e_we=178, e_sn=235. (But I have to say that the same error was reported for running wrf.exe with the three domains).
I also attached the 'namelist.wps', 'namelist.input', and 'rsl.error.0000' files below. The static geographic data is currently changed from the previous MODIS LU/LC data to the 'CGLC_MODIS_LCZ_100m' from WRF website.
Looking forward to your reply!

Regards,
Hilson
 

Attachments

  • namelist.input
    4.3 KB · Views: 2
  • rsl.error.0000
    139.7 KB · Views: 1
  • namelist.wps
    735 bytes · Views: 1
Hi everyone,
I also got the same problem with WRF v4.5.1. I re-compiled in debug mode and I recover the same error message.

As the debugging execution says, error happens at phys/module_sf_noahmpdrv.F:
Code:
(...)
             masslai = 1000. / max(SLA_TABLE(IVGTYP(I,J)),1.0) ! conversion from lai to mass  (v3.7)


Diving into the code, definition of size of SLA_TABLE is defined as:
Code:
$ grep -i SLA_TABLE phys/*F
phys/module_sf_noahmpdrv.F:  parameters%SLA    =    SLA_TABLE(VEGTYPE)       !single-side leaf area per Kg [m2/kg]
phys/module_sf_noahmpdrv.F:             masslai = 1000. / max(SLA_TABLE(IVGTYP(I,J)),1.0) ! conversion from lai to mass  (v3.7)
phys/module_sf_noahmplsm.F:    REAL :: SLA_TABLE(MVT)         !single-side leaf area per Kg [m2/kg]
phys/module_sf_noahmplsm.F:    SLA_TABLE    = -1.0E36
phys/module_sf_noahmplsm.F:       SLA_TABLE(1:NVEG)  = SLA(1:NVEG)
within phys/module_sf_noahmplsm.F, size (NVEG) of the table come from reading the file run/MPTABLE.TBL:
Code:
NAMELIST / noahmp_modis_veg_categories / VEG_DATASET_DESCRIPTION, NVEG
(...)
    if ( trim(DATASET_IDENTIFIER) == "USGS" ) then
       read(15,noahmp_usgs_veg_categories)
       read(15,noahmp_usgs_parameters)
    else if ( trim(DATASET_IDENTIFIER) == "MODIFIED_IGBP_MODIS_NOAH" ) then
       read(15,noahmp_modis_veg_categories)
       read(15,noahmp_modis_parameters)
(...)
Inside the file run/MPTABLE.TBL one founds:
Code:
$ cat run/MPTABLE.TBL | grep -B 2 NVEG
&noahmp_usgs_veg_categories
 VEG_DATASET_DESCRIPTION = "USGS"
 NVEG = 27
/
&noahmp_usgs_parameters
 ! NVEG = 27
--
&noahmp_modis_veg_categories
 VEG_DATASET_DESCRIPTION = "modified igbp modis noah"
 NVEG = 20
Therefore we got the 27 or 20 size of the table.
I am using also the LCZ from WUDAPT from which one gets 60 LANDUSE types, from my wrfinput_d0[1/2/3]:
Code:
$ ncdump -h wrfinput_d01
(...)
        :MMINLU = "MODIFIED_IGBP_MODIS_NOAH" ;
        :NUM_LAND_CAT = 61 ;
(...)
Looking into phys/module_sf_noahmpdrv.F, there is already an identification of the urban case:
Code:
(...)
  parameters%URBAN_FLAG = .FALSE.
  IF( VEGTYPE == ISURBAN_TABLE    .or. VEGTYPE == LCZ_1_TABLE .or. VEGTYPE == LCZ_2_TABLE .or. &
             VEGTYPE == LCZ_3_TABLE      .or. VEGTYPE == LCZ_4_TABLE .or. VEGTYPE == LCZ_5_TABLE .or. &
             VEGTYPE == LCZ_6_TABLE      .or. VEGTYPE == LCZ_7_TABLE .or. VEGTYPE == LCZ_8_TABLE .or. &
             VEGTYPE == LCZ_9_TABLE      .or. VEGTYPE == LCZ_10_TABLE .or. VEGTYPE == LCZ_11_TABLE ) THEN
      parameters%URBAN_FLAG = .TRUE.
  ENDIF
Looking for these values in the phys/surface_driver.F, one founds the re-sizing of the value when urban is activated:
Code:
   USE module_sf_noahlsm,   only : LCZ_1,LCZ_2,LCZ_3,LCZ_4,LCZ_5,LCZ_6,LCZ_7,LCZ_8,LCZ_9,LCZ_10,LCZ_11
(...)
    IF(SF_URBAN_PHYSICS.eq.1) THEN
       DO j=j_start(ij),j_end(ij)                             !urban
         DO i=i_start(ij),i_end(ij)                           !urban
           IF(IVGTYP(I,J) == ISURBAN   .or. IVGTYP(I,J) == LCZ_1 .or. IVGTYP(I,J) == LCZ_2 .or. &
             IVGTYP(I,J) == LCZ_3      .or. IVGTYP(I,J) == LCZ_4 .or. IVGTYP(I,J) == LCZ_5 .or. &
             IVGTYP(I,J) == LCZ_6      .or. IVGTYP(I,J) == LCZ_7 .or. IVGTYP(I,J) == LCZ_8 .or. &
             IVGTYP(I,J) == LCZ_9      .or. IVGTYP(I,J) == LCZ_10 .or. IVGTYP(I,J) == LCZ_11 )THEN
I realized that in our namelist.input, we got:
Code:
 sf_urban_physics                    =  0,  0,  3,
But all my wrfinput_d0[1/2/3] have 61 landuses categories. Which I guess, this might be the source of the error?
I wil perform some additional tests and will let you know

Lluís
 
@ Lluís @ Hilson,
I am sorry to get back to you late. This is because it takes time for us to fix possible issues shown in your cases. Please take a look at the document here, and let me know if you still have problems.

Note that the bug fix is not officially released and we may still need to conduct more tests to make sure it works as expected. Your feedback will be helpful. Thanks for reporting this issue.
 
Top