Hi All,
I am running WRF, and sometimes I get error message as:
MPT ERROR: MPI_COMM_WORLD rank 28 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
Normally, I change nothing, simply submit the same job again, and it will be succeed. However, I have to solve this problem because it wastes computing hour and time. Looking forward to get the solution here. Thank you. Lintao
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(1) The rsl.error.0000 file begins normal:
taskid: 0 hostname: r11i4n35
module_io_quilt_old.F 2931 F
Quilting with 2 groups of 8 I/O tasks.
Ntasks in X 24 , ntasks in Y 24
--- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
--- NOTE: grid_fdda is 0 for domain 1, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain 1, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain 1, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
Need MYNN PBL for icloud_bl = 1, resetting to 0
*************************************
No physics suite selected.
Physics options will be used directly from the namelist.
*************************************
--- NOTE: RRTMG radiation is in use, setting: levsiz=59, alevsiz=12, no_src_types=6
--- NOTE: num_soil_layers has been set to 4
WRF V3.9.1.1 MODEL
*************************************
Parent domain
ids,ide,jds,jde 1 470 1 470
ims,ime,jms,jme -4 27 -4 27
ips,ipe,jps,jpe 1 20 1 20
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 47831736 bytes allocated
RESTART run: opening wrfrst_d01_1999-04-01_00:00:00 for reading
Domain 1 Setting history stream 24 for xtime
Domain 1 Setting history stream 24 for t2
Domain 1 Setting history stream 24 for rainnc
Domain 1 Setting history stream 24 for i_rainnc
Domain 1 Setting history stream 24 for rainc
Domain 1 Setting history stream 24 for i_rainc
Timing for processing restart file for domain 1: 13.45700 elapsed seconds
Max map factor in domain 1 = 1.13. Scale the dt in the model accordingly.
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
ThompMP: read qr_acr_qg.dat stead of computing
ThompMP: read qr_acr_qs.dat instead of computing
ThompMP: read freezeH2O.dat stead of computing
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-01_00:00:00 for domain 1: 0.07608 elapsed seconds
Time in file: 1999-01-01_00:00:00
Time on domain: 1999-04-01_00:00:00
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Time in file: 1999-01-01_06:00:00
Time on domain: 1999-04-01_00:00:00
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Time in file: 1999-01-01_12:00:00
Time on domain: 1999-04-01_00:00:00
(2) However, the rsl.err.0000 ends abnormal, it takes particularly longer time elapsed seconds for writing WRF outputs, and for main:
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-08_23:30:00 for domain 1: 0.02153 elapsed seconds
Timing for main: time 1999-04-08_23:31:00 on domain 1: 0.79337 elapsed seconds
Timing for main: time 1999-04-08_23:32:00 on domain 1: 0.14642 elapsed seconds
Timing for main: time 1999-04-08_23:33:00 on domain 1: 0.13997 elapsed seconds
Timing for main: time 1999-04-08_23:34:00 on domain 1: 0.14366 elapsed seconds
Timing for main: time 1999-04-08_23:35:00 on domain 1: 0.15100 elapsed seconds
Timing for main: time 1999-04-08_23:36:00 on domain 1: 0.14327 elapsed seconds
Timing for main: time 1999-04-08_23:37:00 on domain 1: 0.14526 elapsed seconds
Timing for main: time 1999-04-08_23:38:00 on domain 1: 0.14353 elapsed seconds
Timing for main: time 1999-04-08_23:39:00 on domain 1: 0.14659 elapsed seconds
Timing for main: time 1999-04-08_23:40:00 on domain 1: 0.15301 elapsed seconds
Timing for main: time 1999-04-08_23:41:00 on domain 1: 0.78110 elapsed seconds
Timing for main: time 1999-04-08_23:42:00 on domain 1: 0.14358 elapsed seconds
Timing for main: time 1999-04-08_23:43:00 on domain 1: 0.14528 elapsed seconds
Timing for main: time 1999-04-08_23:44:00 on domain 1: 0.14630 elapsed seconds
Timing for main: time 1999-04-08_23:45:00 on domain 1: 0.14771 elapsed seconds
Timing for main: time 1999-04-08_23:46:00 on domain 1: 0.14317 elapsed seconds
Timing for main: time 1999-04-08_23:47:00 on domain 1: 0.14344 elapsed seconds
Timing for main: time 1999-04-08_23:48:00 on domain 1: 0.14372 elapsed seconds
Timing for main: time 1999-04-08_23:49:00 on domain 1: 0.14800 elapsed seconds
Timing for main: time 1999-04-08_23:50:00 on domain 1: 0.16002 elapsed seconds
Timing for main: time 1999-04-08_23:51:00 on domain 1: 0.77700 elapsed seconds
Timing for main: time 1999-04-08_23:52:00 on domain 1: 0.14343 elapsed seconds
Timing for main: time 1999-04-08_23:53:00 on domain 1: 0.14285 elapsed seconds
Timing for main: time 1999-04-08_23:54:00 on domain 1: 0.14292 elapsed seconds
Timing for main: time 1999-04-08_23:55:00 on domain 1: 0.15241 elapsed seconds
Timing for main: time 1999-04-08_23:56:00 on domain 1: 0.14321 elapsed seconds
Timing for main: time 1999-04-08_23:57:00 on domain 1: 0.14362 elapsed seconds
Timing for main: time 1999-04-08_23:58:00 on domain 1: 0.14432 elapsed seconds
Timing for main: time 1999-04-08_23:59:00 on domain 1: 0.14314 elapsed seconds
Timing for main: time 1999-04-09_00:00:00 on domain 1: 0.15723 elapsed seconds
mediation_integrate.G 1728 DATASET=HISTORY
mediation_integrate.G 1729 grid%id 1 grid%oid 4
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrfout_d01_1999-04-09_00:00:00 for domain 1: 0.39498 elapsed seconds
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-09_00:00:00 for domain 1: 3.86019 elapsed seconds
d01 1999-04-09_00:00:00 Input data processed for aux input 4 for domain 1
Timing for processing lateral boundary for domain 1: 0.83362 elapsed seconds
Timing for main: time 1999-04-09_00:01:00 on domain 1: 5.89514 elapsed seconds
(3) Part of the namelists is as below:
&time_control
run_days = 91,
run_hours = 0,
run_minutes = 0,
run_seconds = 0,
start_year = 1999, 2000, 2000,
start_month = 04, 01, 01,
start_day = 01, 24, 24,
start_hour = 00, 12, 12,
start_minute = 00, 00, 00,
start_second = 00, 00, 00,
end_year = 2016, 2000, 2000,
end_month = 01, 01, 01,
end_day = 01, 25, 25,
end_hour = 00, 12, 12,
end_minute = 00, 00, 00,
end_second = 00, 00, 00,
interval_seconds = 21600
input_from_file = .true.,.true.,.true.,
history_interval = 60, 60, 60,
frames_per_outfile = 1, 1000, 1000,
history_outname = "/glade/scratch/lintao/conus2/wrfout1999/wrfout_d<domain>_<date>"
restart = .true.,
restart_interval = 131040,
(4) Part of the job script is as below:
### Select 16 nodes with 36 CPUs, for 576 MPI processes
#PBS -l select=16:ncpus=36:mpiprocs=36+2:ncpus=8:mpiprocs=8
rm /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run/namelist.input
cp /glade/scratch/lintao/lake_test/settings/namelist.input_conusii_AMJ /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run/namelist.input
cd /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run
mpiexec_mpt ./wrf.exe
Please let me know if you need any further information.
Thank you in advance.
Best regards,
Lintao
I am running WRF, and sometimes I get error message as:
MPT ERROR: MPI_COMM_WORLD rank 28 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
Normally, I change nothing, simply submit the same job again, and it will be succeed. However, I have to solve this problem because it wastes computing hour and time. Looking forward to get the solution here. Thank you. Lintao
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(1) The rsl.error.0000 file begins normal:
taskid: 0 hostname: r11i4n35
module_io_quilt_old.F 2931 F
Quilting with 2 groups of 8 I/O tasks.
Ntasks in X 24 , ntasks in Y 24
--- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
--- NOTE: grid_fdda is 0 for domain 1, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain 1, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain 1, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
Need MYNN PBL for icloud_bl = 1, resetting to 0
*************************************
No physics suite selected.
Physics options will be used directly from the namelist.
*************************************
--- NOTE: RRTMG radiation is in use, setting: levsiz=59, alevsiz=12, no_src_types=6
--- NOTE: num_soil_layers has been set to 4
WRF V3.9.1.1 MODEL
*************************************
Parent domain
ids,ide,jds,jde 1 470 1 470
ims,ime,jms,jme -4 27 -4 27
ips,ipe,jps,jpe 1 20 1 20
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 47831736 bytes allocated
RESTART run: opening wrfrst_d01_1999-04-01_00:00:00 for reading
Domain 1 Setting history stream 24 for xtime
Domain 1 Setting history stream 24 for t2
Domain 1 Setting history stream 24 for rainnc
Domain 1 Setting history stream 24 for i_rainnc
Domain 1 Setting history stream 24 for rainc
Domain 1 Setting history stream 24 for i_rainc
Timing for processing restart file for domain 1: 13.45700 elapsed seconds
Max map factor in domain 1 = 1.13. Scale the dt in the model accordingly.
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
ThompMP: read qr_acr_qg.dat stead of computing
ThompMP: read qr_acr_qs.dat instead of computing
ThompMP: read freezeH2O.dat stead of computing
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-01_00:00:00 for domain 1: 0.07608 elapsed seconds
Time in file: 1999-01-01_00:00:00
Time on domain: 1999-04-01_00:00:00
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Time in file: 1999-01-01_06:00:00
Time on domain: 1999-04-01_00:00:00
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Time in file: 1999-01-01_12:00:00
Time on domain: 1999-04-01_00:00:00
(2) However, the rsl.err.0000 ends abnormal, it takes particularly longer time elapsed seconds for writing WRF outputs, and for main:
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-08_23:30:00 for domain 1: 0.02153 elapsed seconds
Timing for main: time 1999-04-08_23:31:00 on domain 1: 0.79337 elapsed seconds
Timing for main: time 1999-04-08_23:32:00 on domain 1: 0.14642 elapsed seconds
Timing for main: time 1999-04-08_23:33:00 on domain 1: 0.13997 elapsed seconds
Timing for main: time 1999-04-08_23:34:00 on domain 1: 0.14366 elapsed seconds
Timing for main: time 1999-04-08_23:35:00 on domain 1: 0.15100 elapsed seconds
Timing for main: time 1999-04-08_23:36:00 on domain 1: 0.14327 elapsed seconds
Timing for main: time 1999-04-08_23:37:00 on domain 1: 0.14526 elapsed seconds
Timing for main: time 1999-04-08_23:38:00 on domain 1: 0.14353 elapsed seconds
Timing for main: time 1999-04-08_23:39:00 on domain 1: 0.14659 elapsed seconds
Timing for main: time 1999-04-08_23:40:00 on domain 1: 0.15301 elapsed seconds
Timing for main: time 1999-04-08_23:41:00 on domain 1: 0.78110 elapsed seconds
Timing for main: time 1999-04-08_23:42:00 on domain 1: 0.14358 elapsed seconds
Timing for main: time 1999-04-08_23:43:00 on domain 1: 0.14528 elapsed seconds
Timing for main: time 1999-04-08_23:44:00 on domain 1: 0.14630 elapsed seconds
Timing for main: time 1999-04-08_23:45:00 on domain 1: 0.14771 elapsed seconds
Timing for main: time 1999-04-08_23:46:00 on domain 1: 0.14317 elapsed seconds
Timing for main: time 1999-04-08_23:47:00 on domain 1: 0.14344 elapsed seconds
Timing for main: time 1999-04-08_23:48:00 on domain 1: 0.14372 elapsed seconds
Timing for main: time 1999-04-08_23:49:00 on domain 1: 0.14800 elapsed seconds
Timing for main: time 1999-04-08_23:50:00 on domain 1: 0.16002 elapsed seconds
Timing for main: time 1999-04-08_23:51:00 on domain 1: 0.77700 elapsed seconds
Timing for main: time 1999-04-08_23:52:00 on domain 1: 0.14343 elapsed seconds
Timing for main: time 1999-04-08_23:53:00 on domain 1: 0.14285 elapsed seconds
Timing for main: time 1999-04-08_23:54:00 on domain 1: 0.14292 elapsed seconds
Timing for main: time 1999-04-08_23:55:00 on domain 1: 0.15241 elapsed seconds
Timing for main: time 1999-04-08_23:56:00 on domain 1: 0.14321 elapsed seconds
Timing for main: time 1999-04-08_23:57:00 on domain 1: 0.14362 elapsed seconds
Timing for main: time 1999-04-08_23:58:00 on domain 1: 0.14432 elapsed seconds
Timing for main: time 1999-04-08_23:59:00 on domain 1: 0.14314 elapsed seconds
Timing for main: time 1999-04-09_00:00:00 on domain 1: 0.15723 elapsed seconds
mediation_integrate.G 1728 DATASET=HISTORY
mediation_integrate.G 1729 grid%id 1 grid%oid 4
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrfout_d01_1999-04-09_00:00:00 for domain 1: 0.39498 elapsed seconds
Timing for Writing /glade/scratch/lintao/conus2/wrfout1999/wrf2d_d01_1999-04-09_00:00:00 for domain 1: 3.86019 elapsed seconds
d01 1999-04-09_00:00:00 Input data processed for aux input 4 for domain 1
Timing for processing lateral boundary for domain 1: 0.83362 elapsed seconds
Timing for main: time 1999-04-09_00:01:00 on domain 1: 5.89514 elapsed seconds
(3) Part of the namelists is as below:
&time_control
run_days = 91,
run_hours = 0,
run_minutes = 0,
run_seconds = 0,
start_year = 1999, 2000, 2000,
start_month = 04, 01, 01,
start_day = 01, 24, 24,
start_hour = 00, 12, 12,
start_minute = 00, 00, 00,
start_second = 00, 00, 00,
end_year = 2016, 2000, 2000,
end_month = 01, 01, 01,
end_day = 01, 25, 25,
end_hour = 00, 12, 12,
end_minute = 00, 00, 00,
end_second = 00, 00, 00,
interval_seconds = 21600
input_from_file = .true.,.true.,.true.,
history_interval = 60, 60, 60,
frames_per_outfile = 1, 1000, 1000,
history_outname = "/glade/scratch/lintao/conus2/wrfout1999/wrfout_d<domain>_<date>"
restart = .true.,
restart_interval = 131040,
(4) Part of the job script is as below:
### Select 16 nodes with 36 CPUs, for 576 MPI processes
#PBS -l select=16:ncpus=36:mpiprocs=36+2:ncpus=8:mpiprocs=8
rm /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run/namelist.input
cp /glade/scratch/lintao/lake_test/settings/namelist.input_conusii_AMJ /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run/namelist.input
cd /glade/scratch/lintao/lake_test/WRF_compiled_limit_rblim_plus/run
mpiexec_mpt ./wrf.exe
Please let me know if you need any further information.
Thank you in advance.
Best regards,
Lintao