Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF.exe hangs with restart_syscall - resuming interrupted poll - unfinished

elad

New member
Hello,

A colleague of mine is having problems with WRF v4.1.3, and also v4.5.
After a random running time, the process simply hangs while still using CPU cores, and there's no output since that point, obviously.

this is the output from 'srun' strace
Bash:
strace: Process 3154776 attached with 5 threads
[pid 3154781] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 3154780] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 3154778] rt_sigtimedwait([HUP INT QUIT USR1 USR2 PIPE ALRM TERM CONT],  <unfinished ...>
[pid 3154776] futex(0x7bbe54, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3154779] restart_syscall(<... resuming interrupted rt_sigtimedwait ...>

This is the output from strace wrf.exe:
Bash:
strace -Tff -p 3155097
strace: Process 3155097 attached with 3 threads
[pid 3155097] [ Process PID=3155097 runs in x32 mode. ]
[pid 3155237] epoll_wait(32,  <unfinished ...>
[pid 3155233] epoll_wait(29,

I'll also ask him to answer follow up questions if I will not be able to answer those.

TIA,
E.
 

Attachments

  • namelist.input
    4.8 KB · Views: 1
Last edited:
Hello,
Here's some more info:


Bash:
 Times =
  "2020-01-03_00:00:00",
  "2020-01-03_06:00:00",
  "2020-01-03_12:00:00",
  "2020-01-03_18:00:00",
  "2020-01-04_00:00:00",
  "2020-01-04_06:00:00",
  "2020-01-04_12:00:00",
  "2020-01-04_18:00:00",
  "2020-01-05_00:00:00",
  "2020-01-05_06:00:00",
  "2020-01-05_12:00:00",
  "2020-01-05_18:00:00",
  "2020-01-06_00:00:00",
  "2020-01-06_06:00:00",
  "2020-01-06_12:00:00",
  "2020-01-06_18:00:00",
  "2020-01-07_00:00:00",
  "2020-01-07_06:00:00",
  "2020-01-07_12:00:00",
  "2020-01-07_18:00:00",
  "2020-01-08_00:00:00",
  "2020-01-08_06:00:00",
  "2020-01-08_12:00:00",
  "2020-01-08_18:00:00",
  "2020-01-09_00:00:00",
  "2020-01-09_06:00:00",
  "2020-01-09_12:00:00",
  "2020-01-09_18:00:00" ;
}
 

Attachments

  • rsl.error.0000
    2.3 KB · Views: 1
Hi,
It's odd that the rsl file printed that the simulation was successful, but that it didn't move past the initial time. I have a few thoughts.

1) Because this is a fairly large domain (or at least in the i direction), they may need to run with more processors than 80. Take a look at this FAQ that discusses choosing an appropriate number of processors, based on domain size.

2) I'm not sure if it could cause any issues, but I've never seen anyone set frames_per_outfile to such a large number. Ask them to decrease the value from 500,000 to 1,000. All the times for this simulation should still fit in a single time with that setting.

3) There are a lot of advanced output settings in this namelist (e.g., several different aux* settings, io_fields). Sometimes to narrow down an issue, trying to run with just a basic namelist can eliminate that the problem may be related to one of those.

4) Often when a simulation stops/stalls right at the beginning, it can mean the input data is junky. Have them check the met_em* files for any NaN values or unreasonable data. Check all variables and all levels.

If none of this is helpful, please have them package all the rsl* files into a single *.tar file and attach that so I can take a look.

You mention they are having trouble with V4.1.3 and V4.5. Is the issue the same for both versions?
 
Hi @kwerner,
Thank you for your detailed reply,
It was indeed the number of CPUs that crashed the process,
My colleague managed to complete the simulation successfully with 192 CPUs.


Thank you again for your help!
 
Top