Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF-Chem and InfiniBand (or lack thereof)

gerrypkan

New member
Howdy Folks:

I have a quick question. Our cluster administrators decided to replace their InfiniBand switch with straight ethernet a few weeks ago. I tried to launch a few WRF-Chem jobs to the cluster. A very small amount went through without issues, but most of them came down with the following errors:

1) "Local protection" - This usually happens at the very beginning of the run.
Code:
ib_mlx5_log.c:167  Local protection on mlx5_0:1/RoCE (synd 0x4 vend 0x52 hw_synd 0/146)
ib_mlx5_log.c:167  DCI QP 0x3f45 wqe[58]: SEND s-e [rqpn 0x5590 rmac a0:88:c2:c6:d3:aa sgix 3 dgid ::ffff:10.12.1.168 tc 106] [va 0x197ef280 len 8256 lkey 0x5c2c0a]

2) "Transport retry count exceeded - This usually happened in the middle of the run.
Code:
ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
ib_mlx5_log.c:143  DCI QP 0x1525 wqe[0]: SEND s-e [rqpn 0x223e4 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]

Based on the backtrace following the error, these offending lines come from the RSL_Lite module. And it seems to me there are some explicit dependencies on the mlx5 RoCE protocol (which is InfiniBand). So that begs the questions:

1) Is WRF (or WRF-Chem) developed for InfiniBand communication specifically?
2) If not, how should I modify configure.wrf so that WRF is built on a straight ethernet HPC setup?

Thanks in advance, Gerry.

P.S. - We are currently in very close correspondence with our cluster administration, and we are still working on a resolution as WRF(-Chem) is the only application that is showing this problem.
 
Last edited:
Top