Hi all
I'm trying to get to the bottom of some weird issues when running WRF.
- I was initially trying to run it on a large memory 64-core VM. WRF fails to complete both in serial and dmpar modes
- I ran it on a cluster I have access to. With the same data / namelist files, WRF completed when ran in serial mode (both on a login node and when submitted as a job), but failed when run as a submitted mpi job
- real.exe always completes successfully
- in all cases, I ran as a 4 core job (this is the max number of cores the domain size would be able to run on)
- given that the model does complete in 1 scenario (serial on the cluster) I assume that the namelist file and input data are ok, so I won't upload them here yet.
- on the VM, I have tried both compiling with OpenMPI and with MPICH, and I get the same same errors with both
I compiled in debug mode when running in serial on the VM - the end of the output is
Are there any other sensible things for me to try at the moment? Is it worth me uploading all the output files from one of the MPI runs?
I'm trying to get to the bottom of some weird issues when running WRF.
- I was initially trying to run it on a large memory 64-core VM. WRF fails to complete both in serial and dmpar modes
- I ran it on a cluster I have access to. With the same data / namelist files, WRF completed when ran in serial mode (both on a login node and when submitted as a job), but failed when run as a submitted mpi job
- real.exe always completes successfully
- in all cases, I ran as a 4 core job (this is the max number of cores the domain size would be able to run on)
- given that the model does complete in 1 scenario (serial on the cluster) I assume that the namelist file and input data are ok, so I won't upload them here yet.
- on the VM, I have tried both compiling with OpenMPI and with MPICH, and I get the same same errors with both
I compiled in debug mode when running in serial on the VM - the end of the output is
Code:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f4ed9683960 in ???
#1 0x7f4ed9682ac5 in ???
#2 0x7f4ed937051f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x56518e261d35 in __module_advect_em_MOD_advect_scalar_pd
at /data/wrf/Build_WRF/wrf_serial/dyn_em/module_advect_em.f90:7657
#4 0x56518c114379 in __module_em_MOD_rk_scalar_tend
at /data/wrf/Build_WRF/wrf_serial/dyn_em/module_em.f90:1265
#5 0x56518b99a16a in solve_em_
at /data/wrf/Build_WRF/wrf_serial/dyn_em/solve_em.f90:3041
#6 0x56518b69e79f in solve_interface_
at /data/wrf/Build_WRF/wrf_serial/share/solve_interface.f90:141
#7 0x56518a2fc230 in __module_integrate_MOD_integrate
at /data/wrf/Build_WRF/wrf_serial/frame/module_integrate.f90:325
#8 0x56518a2e9257 in __module_wrf_top_MOD_wrf_run
at ../main/module_wrf_top.f90:326
#9 0x56518a2e7e70 in wrf
at /data/wrf/Build_WRF/wrf_serial/main/wrf.f90:29
#10 0x56518a2e7ed4 in main
at /data/wrf/Build_WRF/wrf_serial/main/wrf.f90:6
Floating point exception (core dumped)
Are there any other sensible things for me to try at the moment? Is it worth me uploading all the output files from one of the MPI runs?