Hello,
We are running parallelization tests and using the script referenced here: How many processors should I use to run WRF? to determine the correct number of nodes needed. We are using four domains and using ndown to do so. We have been able to run without issues up to the third domain.
To run WRF for the 4th domain the Python script mentioned returns a value of 58 nodes (each of our nodes has 48 cores).
The problem we are seeing is that the process seems to start, writes a few lines (as shown in the rsl file attached), and then just stays running but doesn't seem to do anything else. It is running but there are no outputs, no results, and no logs. It ends up being killed after the scheduled time is up. We have let the process run for 24 hours just to be certain that nothing was happening.
We have done several tests with different configurations and it ended up running when we used 40 nodes and a time_step of 2 (lower than we had). Our question is why there is such a difference between the value we got from the script and the one that worked for us. Also, why didn't we get an error related to that (we have gotten errors related to using more nodes than needed previously). Is this related to some other issue like node memory?
Thanks for your help!
We are running parallelization tests and using the script referenced here: How many processors should I use to run WRF? to determine the correct number of nodes needed. We are using four domains and using ndown to do so. We have been able to run without issues up to the third domain.
To run WRF for the 4th domain the Python script mentioned returns a value of 58 nodes (each of our nodes has 48 cores).
The problem we are seeing is that the process seems to start, writes a few lines (as shown in the rsl file attached), and then just stays running but doesn't seem to do anything else. It is running but there are no outputs, no results, and no logs. It ends up being killed after the scheduled time is up. We have let the process run for 24 hours just to be certain that nothing was happening.
We have done several tests with different configurations and it ended up running when we used 40 nodes and a time_step of 2 (lower than we had). Our question is why there is such a difference between the value we got from the script and the one that worked for us. Also, why didn't we get an error related to that (we have gotten errors related to using more nodes than needed previously). Is this related to some other issue like node memory?
Thanks for your help!