Dear Kwerner,
I performed a series of tests on two different HPCs and got the same pattern of results.
Executions with different numbers of cores but with the same number of NTASK_X present results (wrfinput_d* and wrfout*) equal to each other, but when NTASK_X is different the results are different. For example, runs with 1, 2, 3, 5 and 7 cores (NTASK_X = 1 and NTASK_Y=1, 2, 3, 5, 7) the results are equal to each other. Runs with 12 and 21 cores (NTASK_X = 3 and NTASK_Y = 4, 7 ) the results are equal to each other, but are different from the results obtained for runs with NTASK_X = 1. There seems to be some problem when the model splits the matrix on axis X. Does this information contribute to the solution of the problem?