MPT ERROR

Lintao · Apr 25, 2019

Hi All,
I am running WRF, and sometimes I get error message as:
MPT ERROR: MPI_COMM_WORLD rank 28 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
Normally, I change nothing, simply submit the same job again, and it will be succeed. However, I have to solve this problem because it wastes computing hour and time. Looking forward to get the solution here. Thank you. Lintao
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(1) The rsl.error.0000 file begins normal:
taskid: 0 hostname: r11i4n35
module_io_quilt_old.F 2931 F
Quilting with 2 groups of 8 I/O tasks.
Ntasks in X 24 , ntasks in Y 24
--- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
--- NOTE: grid_fdda is 0 for domain 1, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain 1, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain 1, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
Need MYNN PBL for icloud_bl = 1, resetting to 0
*************************************
No physics suite selected.
Physics options will be used directly from the namelist.
*************************************
--- NOTE: RRTMG radiation is in use, setting: levsiz=59, alevsiz=12, no_src_types=6
--- NOTE: num_soil_layers has been set to 4
WRF V3.9.1.1 MODEL
*************************************
Parent domain
ids,ide,jds,jde 1 470 1 470
ims,ime,jms,jme -4 27 -4 27
ips,ipe,jps,jpe 1 20 1 20
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 47831736 bytes allocated
RESTART run: opening wrfrst_d01_1999-04-01_00:00:00 for reading
Domain 1 Setting history stream 24 for xtime
Domain 1 Setting history stream 24 for t2
Domain 1 Setting history stream 24 for rainnc
Domain 1 Setting history stream 24 for i_rainnc
Domain 1 Setting history stream 24 for rainc
Domain 1 Setting history stream 24 for i_rainc
Timing for processing restart file for domain 1: 13.45700 elapsed seconds
Max map factor in domain 1 = 1.13. Scale the dt in the model accordingly.
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
ThompMP: read qr_acr_qg.dat stead of computing
ThompMP: read qr_acr_qs.dat instead of computing
ThompMP: read freezeH2O.dat stead of computing
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-01_00:00:00 for domain 1: 0.07608 elapsed seconds
Time in file: 1999-01-01_00:00:00
Time on domain: 1999-04-01_00:00:00
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Time in file: 1999-01-01_06:00:00
Time on domain: 1999-04-01_00:00:00
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Time in file: 1999-01-01_12:00:00
Time on domain: 1999-04-01_00:00:00

(2) However, the rsl.err.0000 ends abnormal, it takes particularly longer time elapsed seconds for writing WRF outputs, and for main:
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-08_23:30:00 for domain 1: 0.02153 elapsed seconds
Timing for main: time 1999-04-08_23:31:00 on domain 1: 0.79337 elapsed seconds
Timing for main: time 1999-04-08_23:32:00 on domain 1: 0.14642 elapsed seconds
Timing for main: time 1999-04-08_23:33:00 on domain 1: 0.13997 elapsed seconds
Timing for main: time 1999-04-08_23:34:00 on domain 1: 0.14366 elapsed seconds
Timing for main: time 1999-04-08_23:35:00 on domain 1: 0.15100 elapsed seconds
Timing for main: time 1999-04-08_23:36:00 on domain 1: 0.14327 elapsed seconds
Timing for main: time 1999-04-08_23:37:00 on domain 1: 0.14526 elapsed seconds
Timing for main: time 1999-04-08_23:38:00 on domain 1: 0.14353 elapsed seconds
Timing for main: time 1999-04-08_23:39:00 on domain 1: 0.14659 elapsed seconds
Timing for main: time 1999-04-08_23:40:00 on domain 1: 0.15301 elapsed seconds
Timing for main: time 1999-04-08_23:41:00 on domain 1: 0.78110 elapsed seconds
Timing for main: time 1999-04-08_23:42:00 on domain 1: 0.14358 elapsed seconds
Timing for main: time 1999-04-08_23:43:00 on domain 1: 0.14528 elapsed seconds
Timing for main: time 1999-04-08_23:44:00 on domain 1: 0.14630 elapsed seconds
Timing for main: time 1999-04-08_23:45:00 on domain 1: 0.14771 elapsed seconds
Timing for main: time 1999-04-08_23:46:00 on domain 1: 0.14317 elapsed seconds
Timing for main: time 1999-04-08_23:47:00 on domain 1: 0.14344 elapsed seconds
Timing for main: time 1999-04-08_23:48:00 on domain 1: 0.14372 elapsed seconds
Timing for main: time 1999-04-08_23:49:00 on domain 1: 0.14800 elapsed seconds
Timing for main: time 1999-04-08_23:50:00 on domain 1: 0.16002 elapsed seconds
Timing for main: time 1999-04-08_23:51:00 on domain 1: 0.77700 elapsed seconds
Timing for main: time 1999-04-08_23:52:00 on domain 1: 0.14343 elapsed seconds
Timing for main: time 1999-04-08_23:53:00 on domain 1: 0.14285 elapsed seconds
Timing for main: time 1999-04-08_23:54:00 on domain 1: 0.14292 elapsed seconds
Timing for main: time 1999-04-08_23:55:00 on domain 1: 0.15241 elapsed seconds
Timing for main: time 1999-04-08_23:56:00 on domain 1: 0.14321 elapsed seconds
Timing for main: time 1999-04-08_23:57:00 on domain 1: 0.14362 elapsed seconds
Timing for main: time 1999-04-08_23:58:00 on domain 1: 0.14432 elapsed seconds
Timing for main: time 1999-04-08_23:59:00 on domain 1: 0.14314 elapsed seconds
Timing for main: time 1999-04-09_00:00:00 on domain 1: 0.15723 elapsed seconds
mediation_integrate.G 1728 DATASET=HISTORY
mediation_integrate.G 1729 grid%id 1 grid%oid 4
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrfout_d01_1999-04-09_00:00:00 for domain 1: 0.39498 elapsed seconds
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-09_00:00:00 for domain 1: 3.86019 elapsed seconds
d01 1999-04-09_00:00:00 Input data processed for aux input 4 for domain 1
Timing for processing lateral boundary for domain 1: 0.83362 elapsed seconds
Timing for main: time 1999-04-09_00:01:00 on domain 1: 5.89514 elapsed seconds

(3) Part of the namelists is as below:
&time_control
run_days = 91,
run_hours = 0,
run_minutes = 0,
run_seconds = 0,
start_year = 1999, 2000, 2000,
start_month = 04, 01, 01,
start_day = 01, 24, 24,
start_hour = 00, 12, 12,
start_minute = 00, 00, 00,
start_second = 00, 00, 00,
end_year = 2016, 2000, 2000,
end_month = 01, 01, 01,
end_day = 01, 25, 25,
end_hour = 00, 12, 12,
end_minute = 00, 00, 00,
end_second = 00, 00, 00,
interval_seconds = 21600
input_from_file = .true.,.true.,.true.,
history_interval = 60, 60, 60,
frames_per_outfile = 1, 1000, 1000,
history_outname = "/glade/scratch/lintao/conus2/wrfout1999/wrfout_d<domain>_<date>"
restart = .true.,
restart_interval = 131040,

(4) Part of the job script is as below:
### Select 16 nodes with 36 CPUs, for 576 MPI processes
#PBS -l select=16:ncpus=36:mpiprocs=36+2:ncpus=8:mpiprocs=8
rm /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run/namelist.input
cp /glade/scratch/lintao/lake_test/settings/namelist.input_conusii_AMJ /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run/namelist.input
cd /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run
mpiexec_mpt ./wrf.exe

Please let me know if you need any further information.
Thank you in advance.
Best regards,
Lintao

kwerner · Apr 26, 2019

Hi,
Unfortunately this is likely a system or environment problem if the runs are sometimes unstable, and then running the exact same thing allows it to run the next time. It's likely not a problem related to WRF. Based on the directories in your post, it looks like you are using NCAR's Cheyenne system. You should reach out to the CISL support group to see if they have any ideas what may be occurring. I'm sorry I don't have a better solution.

daisyhjt · May 28, 2021

Hi Lintao,
I met a similar problem as yours. Just wanted to know how did you fix this memory leak issue finally? I tried to contact the CISL support group, but unfortunately, no one responded.
Thanks in advance.

kwerner · Jun 2, 2021

@daisyhjt,
If it's been more than a week since you contacted CISL support, try to reach out again.

MPT ERROR

Lintao

New member

kwerner

Administrator

daisyhjt

New member

kwerner

Administrator