I am trying to run simulations with the 3-15 km variable resolution mesh with 1024 cores (64 nodes * 16 cores/node). The simulation marched for nearly 6 hours and stopped without any errors reported by MPAS. The only error messages are can be seen in the log.fcst attached, e.g.
Thanks in advance!
. Would you mind taking a look to see what the reasons might be? Also attached are the log.atmosphere.out, namelist, and streams.[ip-10-0-32-29:24545] 1 more process has sent help message help-mpi-btl-tcp.txt / pe
er hung up
[ip-10-0-32-29:24545] Set MCA parameter "orte_base_help_aggregate" to 0 to see all h
elp / error messages
[ip-10-0-32-29:24571] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24577] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24569] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24584] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24578] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24583] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24581] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24575] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-46-122][[8560,1],913][btl_tcp_frag.c:135:mca_btl_tcp_frag_send] mca_btl_tcp
_frag_send: writev failed: Connection reset by peer (104)
Thanks in advance!