WRF.exe hangs with restart_syscall - resuming interrupted poll - unfinished

elad · Sep 19, 2023

Hello,

A colleague of mine is having problems with WRF v4.1.3, and also v4.5.
After a random running time, the process simply hangs while still using CPU cores, and there's no output since that point, obviously.

this is the output from 'srun' strace

Bash:

strace: Process 3154776 attached with 5 threads
[pid 3154781] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 3154780] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 3154778] rt_sigtimedwait([HUP INT QUIT USR1 USR2 PIPE ALRM TERM CONT],  <unfinished ...>
[pid 3154776] futex(0x7bbe54, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3154779] restart_syscall(<... resuming interrupted rt_sigtimedwait ...>

This is the output from strace wrf.exe:

Bash:

strace -Tff -p 3155097
strace: Process 3155097 attached with 3 threads
[pid 3155097] [ Process PID=3155097 runs in x32 mode. ]
[pid 3155237] epoll_wait(32,  <unfinished ...>
[pid 3155233] epoll_wait(29,

I'll also ask him to answer follow up questions if I will not be able to answer those.

TIA,
E.

elad · Sep 20, 2023

Hello,
Here's some more info:

Bash:

 Times =
  "2020-01-03_00:00:00",
  "2020-01-03_06:00:00",
  "2020-01-03_12:00:00",
  "2020-01-03_18:00:00",
  "2020-01-04_00:00:00",
  "2020-01-04_06:00:00",
  "2020-01-04_12:00:00",
  "2020-01-04_18:00:00",
  "2020-01-05_00:00:00",
  "2020-01-05_06:00:00",
  "2020-01-05_12:00:00",
  "2020-01-05_18:00:00",
  "2020-01-06_00:00:00",
  "2020-01-06_06:00:00",
  "2020-01-06_12:00:00",
  "2020-01-06_18:00:00",
  "2020-01-07_00:00:00",
  "2020-01-07_06:00:00",
  "2020-01-07_12:00:00",
  "2020-01-07_18:00:00",
  "2020-01-08_00:00:00",
  "2020-01-08_06:00:00",
  "2020-01-08_12:00:00",
  "2020-01-08_18:00:00",
  "2020-01-09_00:00:00",
  "2020-01-09_06:00:00",
  "2020-01-09_12:00:00",
  "2020-01-09_18:00:00" ;
}

kwerner · Sep 21, 2023

Hi,
It's odd that the rsl file printed that the simulation was successful, but that it didn't move past the initial time. I have a few thoughts.

1) Because this is a fairly large domain (or at least in the i direction), they may need to run with more processors than 80. Take a look at this FAQ that discusses choosing an appropriate number of processors, based on domain size.

2) I'm not sure if it could cause any issues, but I've never seen anyone set frames_per_outfile to such a large number. Ask them to decrease the value from 500,000 to 1,000. All the times for this simulation should still fit in a single time with that setting.

3) There are a lot of advanced output settings in this namelist (e.g., several different aux* settings, io_fields). Sometimes to narrow down an issue, trying to run with just a basic namelist can eliminate that the problem may be related to one of those.

4) Often when a simulation stops/stalls right at the beginning, it can mean the input data is junky. Have them check the met_em* files for any NaN values or unreasonable data. Check all variables and all levels.

If none of this is helpful, please have them package all the rsl* files into a single *.tar file and attach that so I can take a look.

You mention they are having trouble with V4.1.3 and V4.5. Is the issue the same for both versions?

elad · Sep 26, 2023

Hi @kwerner,
Thank you for your detailed reply,
It was indeed the number of CPUs that crashed the process,
My colleague managed to complete the simulation successfully with 192 CPUs.

Thank you again for your help!

kwerner · Sep 26, 2023

That's great news! Thank you for posting an update to the issue.

WRF.exe hangs with restart_syscall - resuming interrupted poll - unfinished

elad

New member

Attachments

elad

New member

Attachments

kwerner

Administrator

elad

New member

kwerner

Administrator