Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

3-15 km forecast stopped without errors from MPAS

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

xtian15

Member
I am trying to run simulations with the 3-15 km variable resolution mesh with 1024 cores (64 nodes * 16 cores/node). The simulation marched for nearly 6 hours and stopped without any errors reported by MPAS. The only error messages are can be seen in the log.fcst attached, e.g.
[ip-10-0-32-29:24545] 1 more process has sent help message help-mpi-btl-tcp.txt / pe
er hung up
[ip-10-0-32-29:24545] Set MCA parameter "orte_base_help_aggregate" to 0 to see all h
elp / error messages
[ip-10-0-32-29:24571] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24577] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24569] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24584] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24578] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24583] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24581] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-32-29:24575] pml_ob1_sendreq.c:317 FATAL
[ip-10-0-46-122][[8560,1],913][btl_tcp_frag.c:135:mca_btl_tcp_frag_send] mca_btl_tcp
_frag_send: writev failed: Connection reset by peer (104)
. Would you mind taking a look to see what the reasons might be? Also attached are the log.atmosphere.out, namelist, and streams.
Thanks in advance!
 

Attachments

  • log.atmosphere.0000.out.txt
    1.4 MB · Views: 28
  • log.fcst.txt
    28.5 KB · Views: 28
  • namelist.atmosphere.txt
    1.8 KB · Views: 29
  • streams.atmosphere.txt
    1.7 KB · Views: 30
How much memory do you have available on each node? As discussed in this thread, a rough estimate for the memory requirement of an MPAS simulation with 55 vertical levels running in single precision with the default "mesoscale_reference" physics suite is around 0.175 MB per grid column. It looks like you've got about twice the number of vertical levels, and you're running in double precision, so that might just about quadruple the memory requirement. 6488066 columns across 64 nodes might need as much as 71 GB / node (6488066 columns * 0.175 MB/column * 4 / 64 nodes).
 
Top