Dear all,
in my case, wrf (v4.4) is not writing the restart file. I am using openmpi, and obviously the error is related to this. Using 60 nodes and 128 tasks per node, I am running a domain of 2200 x 2200 grid cells á 4 km. I thought first the error is related to I/O quilting, but now I set this to
&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
Here, you find the last lines of the rst.error.0000 file:
Timing for main (dt= 17.45): time 2021-09-01_23:59:42 on domain 1: 0.16857 elapsed seconds
Top of Radiation Driver
CLWRFdiag - T2; tile: 1 T2clmin: 285.6248 T2clmax: 286.7840 TT2clmin: 1373.818 TT2clmax: 30.68833 T2clmean: 286.1159 T2clstd: 0.3615128
Timing for main (dt= 17.44): time 2021-09-02_00:00:00 on domain 1: 0.16765 elapsed seconds
med_hist_out: opened wrfout_d01_2021-09-02_00:00:00 as DATASET=HISTORY
Timing for Writing wrfout_d01_2021-09-02_00:00:00 for domain 1: 88.84060 elapsed seconds
med_hist_out: opened wrfxtrm_d01_2021-09-02_00:00:00 as DATASET=AUXHIST3
Timing for Writing wrfxtrm_d01_2021-09-02_00:00:00 for domain 1: 14.97954 elapsed seconds
med_restart_out: opening rstWRF_d01_2021-09-02_00:00:00 for writing
[ad2-1006:1821090:0:1821090] rndv.c:459 Assertion `status == UCS_OK' failed
/usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c: [ ucp_rndv_progress_rma_zcopy_common() ]
...
456
457 if (req->send.mdesc == NULL) {
458 status = ucp_send_request_add_reg_lane(req, lane);
==> 459 ucs_assert_always(status == UCS_OK);
460 }
461
462 rsc_index = ucp_ep_get_rsc_index(ep, lane);
==== backtrace (tid:1821090) ====
0 0x000000000004d5e0 ucp_rndv_progress_rma_zcopy_common() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c:459
1 0x000000000004e867 ucp_request_try_send() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/core/ucp_request.inl:302
2 0x000000000004e867 ucp_request_send() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/core/ucp_request.inl:327
3 0x000000000004e867 ucp_rndv_req_send_rma_get() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c:826
4 0x000000000004e867 ucp_rndv_receive() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c:1397
5 0x000000000006661e ucp_tag_recv_common() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/tag/tag_recv.c:134
6 0x000000000006661e ucp_tag_recv_common() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/tag/tag_recv.c:137
7 0x000000000006661e ucp_tag_recv_nbx() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/tag/tag_recv.c:224
8 0x00000000000044d1 mca_pml_ucx_recv() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:636
9 0x000000000000516d mca_coll_basic_gatherv_intra() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mca/coll/basic/coll_basic_gatherv.c:88
10 0x00000000000746ba PMPI_Gatherv() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mpi/c/profile/pgatherv.c:196
11 0x0000000000ae039a collect_on_comm0_() ???:0
12 0x0000000000869e91 wrf_patch_to_global_real_() ???:0
13 0x0000000001761890 collect_generic_and_call_pkg_() ???:0
14 0x000000000175f579 collect_real_and_call_pkg_() ???:0
15 0x000000000175dd8e collect_fld_and_call_pkg_() ???:0
16 0x000000000175d34d wrf_write_field1_() ???:0
17 0x000000000175ceff wrf_write_field_() ???:0
18 0x0000000001c930fb wrf_ext_write_field_() ???:0
19 0x000000000158b04f output_wrf_() ???:0
20 0x00000000014f7910 module_io_domain_mp_open_w_dataset_() ???:0
21 0x000000000161f010 med_last_solve_io_() ???:0
22 0x00000000005b9371 module_integrate_mp_integrate_() ???:0
23 0x00000000004171b1 module_wrf_top_mp_wrf_run_() ???:0
24 0x0000000000417164 MAIN__() ???:0
25 0x00000000004170e2 main() ???:0
26 0x000000000003acf3 __libc_start_main() ???:0
27 0x0000000000416fee _start() ???:0
=================================
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
wrf.exe 00000000031DDABB for__signal_handl Unknown Unknown
libpthread-2.28.s 00001482257A1CE0 Unknown Unknown Unknown
libc-2.28.so 0000148225418A9F gsignal Unknown Unknown
libc-2.28.so 00001482253EBE05 abort Unknown Unknown
libucs.so.0.0.0 000014820FA76B62 ucs_fatal_error_m Unknown Unknown
libucs.so.0.0.0 000014820FA76A6C ucs_fatal_error_f Unknown Unknown
libucp.so.0.0.0 000014821420B5E0 Unknown Unknown Unknown
libucp.so.0.0.0 000014821420C867 ucp_rndv_receive Unknown Unknown
libucp.so.0.0.0 000014821422461E ucp_tag_recv_nbx Unknown Unknown
mca_pml_ucx.so 00001482146714D1 mca_pml_ucx_recv Unknown Unknown
mca_coll_basic.so 000014820C22116D mca_coll_basic_ga Unknown Unknown
libmpi.so.40.30.1 00001482261C56BA MPI_Gatherv Unknown Unknown
wrf.exe 0000000000AE039A Unknown Unknown Unknown
wrf.exe 0000000000869E91 Unknown Unknown Unknown
wrf.exe 0000000001761890 Unknown Unknown Unknown
wrf.exe 000000000175F579 Unknown Unknown Unknown
wrf.exe 000000000175DD8E Unknown Unknown Unknown
wrf.exe 000000000175D34D Unknown Unknown Unknown
wrf.exe 000000000175CEFF Unknown Unknown Unknown
wrf.exe 0000000001C930FB Unknown Unknown Unknown
wrf.exe 000000000158B04F Unknown Unknown Unknown
wrf.exe 00000000014F7910 Unknown Unknown Unknown
wrf.exe 000000000161F010 Unknown Unknown Unknown
wrf.exe 00000000005B9371 Unknown Unknown Unknown
wrf.exe 00000000004171B1 Unknown Unknown Unknown
wrf.exe 0000000000417164 Unknown Unknown Unknown
wrf.exe 00000000004170E2 Unknown Unknown Unknown
libc-2.28.so 0000148225404CF3 __libc_start_main Unknown Unknown
wrf.exe 0000000000416FEE Unknown Unknown Unknown
My namelist.input file is attached.
Any help highly appreciated.
Patrick
in my case, wrf (v4.4) is not writing the restart file. I am using openmpi, and obviously the error is related to this. Using 60 nodes and 128 tasks per node, I am running a domain of 2200 x 2200 grid cells á 4 km. I thought first the error is related to I/O quilting, but now I set this to
&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
Here, you find the last lines of the rst.error.0000 file:
Timing for main (dt= 17.45): time 2021-09-01_23:59:42 on domain 1: 0.16857 elapsed seconds
Top of Radiation Driver
CLWRFdiag - T2; tile: 1 T2clmin: 285.6248 T2clmax: 286.7840 TT2clmin: 1373.818 TT2clmax: 30.68833 T2clmean: 286.1159 T2clstd: 0.3615128
Timing for main (dt= 17.44): time 2021-09-02_00:00:00 on domain 1: 0.16765 elapsed seconds
med_hist_out: opened wrfout_d01_2021-09-02_00:00:00 as DATASET=HISTORY
Timing for Writing wrfout_d01_2021-09-02_00:00:00 for domain 1: 88.84060 elapsed seconds
med_hist_out: opened wrfxtrm_d01_2021-09-02_00:00:00 as DATASET=AUXHIST3
Timing for Writing wrfxtrm_d01_2021-09-02_00:00:00 for domain 1: 14.97954 elapsed seconds
med_restart_out: opening rstWRF_d01_2021-09-02_00:00:00 for writing
[ad2-1006:1821090:0:1821090] rndv.c:459 Assertion `status == UCS_OK' failed
/usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c: [ ucp_rndv_progress_rma_zcopy_common() ]
...
456
457 if (req->send.mdesc == NULL) {
458 status = ucp_send_request_add_reg_lane(req, lane);
==> 459 ucs_assert_always(status == UCS_OK);
460 }
461
462 rsc_index = ucp_ep_get_rsc_index(ep, lane);
==== backtrace (tid:1821090) ====
0 0x000000000004d5e0 ucp_rndv_progress_rma_zcopy_common() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c:459
1 0x000000000004e867 ucp_request_try_send() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/core/ucp_request.inl:302
2 0x000000000004e867 ucp_request_send() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/core/ucp_request.inl:327
3 0x000000000004e867 ucp_rndv_req_send_rma_get() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c:826
4 0x000000000004e867 ucp_rndv_receive() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/rndv/rndv.c:1397
5 0x000000000006661e ucp_tag_recv_common() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/tag/tag_recv.c:134
6 0x000000000006661e ucp_tag_recv_common() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/tag/tag_recv.c:137
7 0x000000000006661e ucp_tag_recv_nbx() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/ucx-1.11.1/src/ucp/tag/tag_recv.c:224
8 0x00000000000044d1 mca_pml_ucx_recv() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:636
9 0x000000000000516d mca_coll_basic_gatherv_intra() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mca/coll/basic/coll_basic_gatherv.c:88
10 0x00000000000746ba PMPI_Gatherv() /usr/local/apps/hpcx-openmpi/2.9.0/INTEL/2021.4/sources/openmpi-gitclone/ompi/mpi/c/profile/pgatherv.c:196
11 0x0000000000ae039a collect_on_comm0_() ???:0
12 0x0000000000869e91 wrf_patch_to_global_real_() ???:0
13 0x0000000001761890 collect_generic_and_call_pkg_() ???:0
14 0x000000000175f579 collect_real_and_call_pkg_() ???:0
15 0x000000000175dd8e collect_fld_and_call_pkg_() ???:0
16 0x000000000175d34d wrf_write_field1_() ???:0
17 0x000000000175ceff wrf_write_field_() ???:0
18 0x0000000001c930fb wrf_ext_write_field_() ???:0
19 0x000000000158b04f output_wrf_() ???:0
20 0x00000000014f7910 module_io_domain_mp_open_w_dataset_() ???:0
21 0x000000000161f010 med_last_solve_io_() ???:0
22 0x00000000005b9371 module_integrate_mp_integrate_() ???:0
23 0x00000000004171b1 module_wrf_top_mp_wrf_run_() ???:0
24 0x0000000000417164 MAIN__() ???:0
25 0x00000000004170e2 main() ???:0
26 0x000000000003acf3 __libc_start_main() ???:0
27 0x0000000000416fee _start() ???:0
=================================
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
wrf.exe 00000000031DDABB for__signal_handl Unknown Unknown
libpthread-2.28.s 00001482257A1CE0 Unknown Unknown Unknown
libc-2.28.so 0000148225418A9F gsignal Unknown Unknown
libc-2.28.so 00001482253EBE05 abort Unknown Unknown
libucs.so.0.0.0 000014820FA76B62 ucs_fatal_error_m Unknown Unknown
libucs.so.0.0.0 000014820FA76A6C ucs_fatal_error_f Unknown Unknown
libucp.so.0.0.0 000014821420B5E0 Unknown Unknown Unknown
libucp.so.0.0.0 000014821420C867 ucp_rndv_receive Unknown Unknown
libucp.so.0.0.0 000014821422461E ucp_tag_recv_nbx Unknown Unknown
mca_pml_ucx.so 00001482146714D1 mca_pml_ucx_recv Unknown Unknown
mca_coll_basic.so 000014820C22116D mca_coll_basic_ga Unknown Unknown
libmpi.so.40.30.1 00001482261C56BA MPI_Gatherv Unknown Unknown
wrf.exe 0000000000AE039A Unknown Unknown Unknown
wrf.exe 0000000000869E91 Unknown Unknown Unknown
wrf.exe 0000000001761890 Unknown Unknown Unknown
wrf.exe 000000000175F579 Unknown Unknown Unknown
wrf.exe 000000000175DD8E Unknown Unknown Unknown
wrf.exe 000000000175D34D Unknown Unknown Unknown
wrf.exe 000000000175CEFF Unknown Unknown Unknown
wrf.exe 0000000001C930FB Unknown Unknown Unknown
wrf.exe 000000000158B04F Unknown Unknown Unknown
wrf.exe 00000000014F7910 Unknown Unknown Unknown
wrf.exe 000000000161F010 Unknown Unknown Unknown
wrf.exe 00000000005B9371 Unknown Unknown Unknown
wrf.exe 00000000004171B1 Unknown Unknown Unknown
wrf.exe 0000000000417164 Unknown Unknown Unknown
wrf.exe 00000000004170E2 Unknown Unknown Unknown
libc-2.28.so 0000148225404CF3 __libc_start_main Unknown Unknown
wrf.exe 0000000000416FEE Unknown Unknown Unknown
My namelist.input file is attached.
Any help highly appreciated.
Patrick