To whom it may concern,
I am running 10-minute forecasts for 40 different ensemble members on the stampede2 system. Unfortunately, I encountered a problem: not all ensemble members completed, even though they share nearly identical configurations. The errors appeared to happen on the final I/O portion of the code - in particular, the failing members were not able to write out the netCDF forecast files for the second forecast step. I have come across this problem on other supercomputing systems in the past and the fix has usually pertained to modifying the number of cores per node. A decrease in the number of cores per node gives the MPI processes on a given node access to more memory. To test this hypothesis, I created a working (debug) directory for one of the members that failed (member 4) and varied the number of cores using 25, 34 and 50 on the development partition of the KNL compute nodes. Unfortunately, even the 25 cores/node test failed to produce the required netCDF files.
The directories in which those tests are conducted have the path: /scratch/06421/hch/systematic_experiments/test2knl_AERI_11july2015. In this directory, you will find subdirectories with names test _*, which contain the aforedescribed experiments. Note that the mem01 directory contains results from the successful member, while mem04 is the member in which the WRF model fails to execute fully. Within each of those subdirectories, you will also find the job submission script under the name job.slurm. Standard output/error files of interest are: rsl.out.0000, rsl.error.0000, wrf.output. I am currently running a final ‘extreme’ experiment, in which the WRF model for member 4 is ran in a single core/single node configuration. The subdirectory for this experiment is test4_mem04.
I would greatly appreciate it if you could look into the problem!
Regards,
Hristo
I am running 10-minute forecasts for 40 different ensemble members on the stampede2 system. Unfortunately, I encountered a problem: not all ensemble members completed, even though they share nearly identical configurations. The errors appeared to happen on the final I/O portion of the code - in particular, the failing members were not able to write out the netCDF forecast files for the second forecast step. I have come across this problem on other supercomputing systems in the past and the fix has usually pertained to modifying the number of cores per node. A decrease in the number of cores per node gives the MPI processes on a given node access to more memory. To test this hypothesis, I created a working (debug) directory for one of the members that failed (member 4) and varied the number of cores using 25, 34 and 50 on the development partition of the KNL compute nodes. Unfortunately, even the 25 cores/node test failed to produce the required netCDF files.
The directories in which those tests are conducted have the path: /scratch/06421/hch/systematic_experiments/test2knl_AERI_11july2015. In this directory, you will find subdirectories with names test _*, which contain the aforedescribed experiments. Note that the mem01 directory contains results from the successful member, while mem04 is the member in which the WRF model fails to execute fully. Within each of those subdirectories, you will also find the job submission script under the name job.slurm. Standard output/error files of interest are: rsl.out.0000, rsl.error.0000, wrf.output. I am currently running a final ‘extreme’ experiment, in which the WRF model for member 4 is ran in a single core/single node configuration. The subdirectory for this experiment is test4_mem04.
I would greatly appreciate it if you could look into the problem!
Regards,
Hristo