Howdy Folks:
I have a quick question. Our cluster administrators decided to replace their InfiniBand switch with straight ethernet a few weeks ago. I tried to launch a few WRF-Chem jobs to the cluster. A very small amount went through without issues, but most of them came down with the following errors:
1) "Local protection" - This usually happens at the very beginning of the run.
2) "Transport retry count exceeded - This usually happened in the middle of the run.
Based on the backtrace following the error, these offending lines come from the RSL_Lite module. And it seems to me there are some explicit dependencies on the mlx5 RoCE protocol (which is InfiniBand). So that begs the questions:
1) Is WRF (or WRF-Chem) developed for InfiniBand communication specifically?
2) If not, how should I modify configure.wrf so that WRF is built on a straight ethernet HPC setup?
Thanks in advance, Gerry.
P.S. - We are currently in very close correspondence with our cluster administration, and we are still working on a resolution as WRF(-Chem) is the only application that is showing this problem.
I have a quick question. Our cluster administrators decided to replace their InfiniBand switch with straight ethernet a few weeks ago. I tried to launch a few WRF-Chem jobs to the cluster. A very small amount went through without issues, but most of them came down with the following errors:
1) "Local protection" - This usually happens at the very beginning of the run.
Code:
ib_mlx5_log.c:167 Local protection on mlx5_0:1/RoCE (synd 0x4 vend 0x52 hw_synd 0/146)
ib_mlx5_log.c:167 DCI QP 0x3f45 wqe[58]: SEND s-e [rqpn 0x5590 rmac a0:88:c2:c6:d3:aa sgix 3 dgid ::ffff:10.12.1.168 tc 106] [va 0x197ef280 len 8256 lkey 0x5c2c0a]
2) "Transport retry count exceeded - This usually happened in the middle of the run.
Code:
ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
ib_mlx5_log.c:143 DCI QP 0x1525 wqe[0]: SEND s-e [rqpn 0x223e4 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
Based on the backtrace following the error, these offending lines come from the RSL_Lite module. And it seems to me there are some explicit dependencies on the mlx5 RoCE protocol (which is InfiniBand). So that begs the questions:
1) Is WRF (or WRF-Chem) developed for InfiniBand communication specifically?
2) If not, how should I modify configure.wrf so that WRF is built on a straight ethernet HPC setup?
Thanks in advance, Gerry.
P.S. - We are currently in very close correspondence with our cluster administration, and we are still working on a resolution as WRF(-Chem) is the only application that is showing this problem.
Last edited: