Hello All,
I am attempting to run the WRF (v3.8) with additional vertical levels, but I get errors through part of the run. First, I would like to provide a rundown of what I am doing.
I am running a real-case based simulation with WRF, using one-way nested grid spacing. The parent 3-km grid (600 X 600 grid points or 1800 X 1800 km) is nested with a 1-km grid (1000 X 1000 grid points/km), and nested within that, a 0.333-km grid (1800 X 1800 grid points or 600 X 600 km). I output at hourly intervals, with the parent grid running from 12Z-12Z (24 hours), with the inner nests running from 21Z-09Z (I am simulating the 30 Jul 2018 central Plains nocturnal MCS event). I run with a time step of 2 seconds. I am running WRF on the UCAR-CISL-Cheyenne supercomputer, employing 150 nodes, with 36 CPUs per node.
When running all grids with the 50 default vertical levels, everything works just fine. When running with 100 vertical levels, 3-km seems to work fine, but once I get to 23Z (2 hours into the simulation incorporating the finer nested grids) the model crashes. I get the following error in the wrf.out log file:
MPT ERROR: MPI_COMM_WORLD rank 3586 has terminated without calling MPI_Finalize()
aborting job
The rsl.out files don’t show that much, but going to the rsl.error.3586 file, I found the following error:
MPT ERROR: Rank 3586(g:3586) received signal SIGSEGV(11).
My understanding is that this error can occur for many reasons. In my directory for all rsl files, I performed the ‘grep cfl rsl*’ command to see if I could fish out cfl errors, but could not find any (though I have heard that sometimes wrf does not always explicitly state cfl errors). I have hourly restart files, so I tried rerunning wrf with a time step of 1 second, while also setting debug to ‘1000’ to see if that would help. I still get the same errors, with no additional information provided. I also tried running with 250 nodes to potentially alleviate a memory allocation issue but again, same errors. Even for rerunning for 1 hour (I attempt to rerun at 23Z), testing with my current model configuration is very expensive in terms of core-hours, and I am trying to be careful with my allotment while debugging. On Cheyenne, my WRF directories are as such:
WPS: /glade/scratch/brianjs/wrf_run/WRF_3.8_nestBuildMPThompsonTendencies/WPS
WRF: /glade/scratch/brianjs/wrf_run/WRF_3.8_nestBuildMPThompsonTendencies/WRFV3/run
In the WRF directory, the wrf.out and rsl files can be found in the ‘rsl_150' folder.
Also in the WRF directory is my script for running wrf.exe (runwrf.tcsh)
In WPS, I have a plot showing my domains (domains_07_30_2018.png).
Any help that can be provided would be greatly appreciated!
I am attempting to run the WRF (v3.8) with additional vertical levels, but I get errors through part of the run. First, I would like to provide a rundown of what I am doing.
I am running a real-case based simulation with WRF, using one-way nested grid spacing. The parent 3-km grid (600 X 600 grid points or 1800 X 1800 km) is nested with a 1-km grid (1000 X 1000 grid points/km), and nested within that, a 0.333-km grid (1800 X 1800 grid points or 600 X 600 km). I output at hourly intervals, with the parent grid running from 12Z-12Z (24 hours), with the inner nests running from 21Z-09Z (I am simulating the 30 Jul 2018 central Plains nocturnal MCS event). I run with a time step of 2 seconds. I am running WRF on the UCAR-CISL-Cheyenne supercomputer, employing 150 nodes, with 36 CPUs per node.
When running all grids with the 50 default vertical levels, everything works just fine. When running with 100 vertical levels, 3-km seems to work fine, but once I get to 23Z (2 hours into the simulation incorporating the finer nested grids) the model crashes. I get the following error in the wrf.out log file:
MPT ERROR: MPI_COMM_WORLD rank 3586 has terminated without calling MPI_Finalize()
aborting job
The rsl.out files don’t show that much, but going to the rsl.error.3586 file, I found the following error:
MPT ERROR: Rank 3586(g:3586) received signal SIGSEGV(11).
My understanding is that this error can occur for many reasons. In my directory for all rsl files, I performed the ‘grep cfl rsl*’ command to see if I could fish out cfl errors, but could not find any (though I have heard that sometimes wrf does not always explicitly state cfl errors). I have hourly restart files, so I tried rerunning wrf with a time step of 1 second, while also setting debug to ‘1000’ to see if that would help. I still get the same errors, with no additional information provided. I also tried running with 250 nodes to potentially alleviate a memory allocation issue but again, same errors. Even for rerunning for 1 hour (I attempt to rerun at 23Z), testing with my current model configuration is very expensive in terms of core-hours, and I am trying to be careful with my allotment while debugging. On Cheyenne, my WRF directories are as such:
WPS: /glade/scratch/brianjs/wrf_run/WRF_3.8_nestBuildMPThompsonTendencies/WPS
WRF: /glade/scratch/brianjs/wrf_run/WRF_3.8_nestBuildMPThompsonTendencies/WRFV3/run
In the WRF directory, the wrf.out and rsl files can be found in the ‘rsl_150' folder.
Also in the WRF directory is my script for running wrf.exe (runwrf.tcsh)
In WPS, I have a plot showing my domains (domains_07_30_2018.png).
Any help that can be provided would be greatly appreciated!