Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF segmentation fault only on some hardware

cw34

New member
Hi all

I'm trying to get to the bottom of some weird issues when running WRF.


- I was initially trying to run it on a large memory 64-core VM. WRF fails to complete both in serial and dmpar modes
- I ran it on a cluster I have access to. With the same data / namelist files, WRF completed when ran in serial mode (both on a login node and when submitted as a job), but failed when run as a submitted mpi job
- real.exe always completes successfully
- in all cases, I ran as a 4 core job (this is the max number of cores the domain size would be able to run on)
- given that the model does complete in 1 scenario (serial on the cluster) I assume that the namelist file and input data are ok, so I won't upload them here yet.
- on the VM, I have tried both compiling with OpenMPI and with MPICH, and I get the same same errors with both

I compiled in debug mode when running in serial on the VM - the end of the output is

Code:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f4ed9683960 in ???
#1  0x7f4ed9682ac5 in ???
#2  0x7f4ed937051f in ???
    at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x56518e261d35 in __module_advect_em_MOD_advect_scalar_pd
    at /data/wrf/Build_WRF/wrf_serial/dyn_em/module_advect_em.f90:7657
#4  0x56518c114379 in __module_em_MOD_rk_scalar_tend
    at /data/wrf/Build_WRF/wrf_serial/dyn_em/module_em.f90:1265
#5  0x56518b99a16a in solve_em_
    at /data/wrf/Build_WRF/wrf_serial/dyn_em/solve_em.f90:3041
#6  0x56518b69e79f in solve_interface_
    at /data/wrf/Build_WRF/wrf_serial/share/solve_interface.f90:141
#7  0x56518a2fc230 in __module_integrate_MOD_integrate
    at /data/wrf/Build_WRF/wrf_serial/frame/module_integrate.f90:325
#8  0x56518a2e9257 in __module_wrf_top_MOD_wrf_run
    at ../main/module_wrf_top.f90:326
#9  0x56518a2e7e70 in wrf
    at /data/wrf/Build_WRF/wrf_serial/main/wrf.f90:29
#10  0x56518a2e7ed4 in main
    at /data/wrf/Build_WRF/wrf_serial/main/wrf.f90:6
Floating point exception (core dumped)


Are there any other sensible things for me to try at the moment? Is it worth me uploading all the output files from one of the MPI runs?
 
Top