Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Persistent MPI crash ("recv failed: Connection reset by peer")

TiagoFCR

New member
Hello,


I'm running WRF v4.6.1 with two domains (15 km and 3 km), using GFS 0.25° data as input. The model runs successfully with GFS 00z, 06z, and 18z initializations, but consistently crashes only when initialized with GFS 12z.

There is no explicit CFL or physics error in the logs. The last line in rsl.error.* is typically:
d01 YYYY-MM-DD_HH:MM:SS Input data is acceptable to use:

After that, the model stops advancing in simulation time — MPI processes remain active, but no further output is generated. The following MPI error appears:
recv(19) failed: Connection reset by peer (104)

I’m launching the model using:
/usr/bin/mpirun -n 26 --bind-to core --map-by core ./wrf.exe

This issue started occurring consistently since April 28.
From late February until then, I was running 12z GFS initializations without problems.
Since April 28, only a few 12z runs completed successfully.

I’ve already tested multiple changes in namelist.input, including different combinations of PBL, microphysics, and cumulus schemes, but the issue persists.
I will attach namelist.input and example rsl.error.* files.

Has anyone seen similar behavior? Any suggestions for debugging or workarounds would be appreciated.

Thanks in advance!
 

Attachments

  • namelist.input
    4.8 KB · Views: 1
  • rsl.error.0000
    976.5 KB · Views: 1
  • rsl.out.0000
    976 KB · Views: 1
Update:
Initially, the issue occurred only with GFS 12z initializations. However, I’m now seeing the same behavior (hangs with no time progress, followed by recv failed: Connection reset by peer) regardless of the initialization time — including 00z, 06z and 18z runs. This suggests the problem is no longer specific to 12z data.


If anyone has ideas, suggestions, or has experienced something similar, I’d really appreciate any input — I’m running out of options to test.
 
Hi, I have a few thoughts about this.

1) The fact that some datasets were working before, but now aren't, could potentially point to either an issue with your computing environment, or is it possible that you're out of disk space where the wrf output is trying to write?
2) For the size of your domains (290x190 and 371x301), you likely need to use a lot more than 26 processors. See Choosing an Appropriate Number of Processors for guidance.
3) Because you're using 26 processors, the decompositional breakdown is 2x13 processors, which is pretty unbalanced. I'm not sure how much this could impact the run, but when you choose the number of processors, it may be best to choose a value whose closest factors are closer in value, so that the decomposition is closer to squared (it doesn't need to be a perfect square).
 
Hi, I have a few thoughts about this.

1) The fact that some datasets were working before, but now aren't, could potentially point to either an issue with your computing environment, or is it possible that you're out of disk space where the wrf output is trying to write?
2) For the size of your domains (290x190 and 371x301), you likely need to use a lot more than 26 processors. See Choosing an Appropriate Number of Processors for guidance.
3) Because you're using 26 processors, the decompositional breakdown is 2x13 processors, which is pretty unbalanced. I'm not sure how much this could impact the run, but when you choose the number of processors, it may be best to choose a value whose closest factors are closer in value, so that the decomposition is closer to squared (it doesn't need to be a perfect square).
Thanks a lot for the suggestions.

I can confirm it's not related to disk space — I’ve checked and there's sufficient space available on the output partitions.

At the moment, I don’t have much more computing capacity available to significantly increase the number of cores, but I’ll try to adjust the processor layout based on your guidance and see if that helps.

I’ll follow up with feedback after testing.
 
Top