Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Running Idealized WRF on CORI

jltorchinsky

New member
Hello,

I am attempting to run WRF on Cori at NERSC and am running into some issues. Here's the all of the relevant information I can think of:

WRF Version: 4.4 (c11bb76939647c4073e9a105ae00faaef55ca7fd)
Modules:
1) modules/3.2.11.4
2) darshan/3.3.1
3) craype-network-aries
4) gcc/11.2.0
5) craype/2.7.10
6) cray-mpich/7.7.19
7) craype-haswell
8) craype-hugepages2M
9) cray-libsci/20.09.1
10) udreg/2.3.2-7.0.3.1_3.16__g5f0d670.ari
11) ugni/6.0.14.0-7.0.3.1_6.4__g8101a58.ari
12) pmi/5.0.17
13) dmapp/7.1.1-7.0.3.1_3.21__g93a7e9f.ari
14) gni-headers/5.0.12.0-7.0.3.1_3.9__gd0d73fe.ari
15) xpmem/2.2.27-7.0.3.1_3.10__gada73ac.ari
16) job/2.2.4-7.0.3.1_3.17__g36b56f4.ari
17) dvs/2.12_2.2.224-7.0.3.1_3.14__gc77db2af
18) alps/6.6.67-7.0.3.1_3.21__gb91cd181.ari
19) rca/2.2.20-7.0.3.1_3.18__g8e3fb5b.ari
20) atp/3.14.9
21) perftools-base/21.12.0
22) PrgEnv-gnu/6.0.10
23) openmpi/4.1.2
24) cray-netcdf-hdf5parallel/4.8.1.1

To configure and build WRF, I followed the instructions located here. In particular, in the topmost directory, I ran

Code:
./configure
./compile em_b_wave &> log.compile

Within ./configure, I selected options 34 (dmpar for GNU) and 1 (basic for nesting). I've attached the compilation log in case it may provide any insights as to what may be going on. At the end, it says that the executables ideal.exe and wrf.exe have been successfully built, and sure enough they are in the main subdirectory.

I've attempted to run ideal.exe both in the run subdirectory and using an sbatch script in scratch space, but both give the same error:
Code:
> mpirun -np 4 ideal.exe 
 starting wrf task            3  of            4
 starting wrf task            2  of            4
 starting wrf task            0  of            4
 starting wrf task            1  of            4
[cori03:13704] *** An error occurred in MPI_Comm_create_keyval
[cori03:13704] *** reported by process [1165492225,0]
[cori03:13704] *** on communicator MPI_COMM_WORLD
[cori03:13704] *** MPI_ERR_ARG: invalid argument of some other kind
[cori03:13704] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cori03:13704] ***    and potentially your MPI job)

I suspect that this is an incompatibility between the version of MPI on Cori and the version of MPI used in developing WRF. Can anybody diagnose this further and give advice on how to resolve this?
 

Attachments

  • log.compile
    816.8 KB · Views: 2
Hi,
I have moved this post to the Idealized cases section of the forum, as this issue is not regarding compiling the model.

The error you are seeing on the screen simply indicates that there was an error when running. To see the specific error, you need to look in the rsl.error.0000 file. However, if there is no specific/helpful error mentioned, the issue is likely that you are running ideal.exe on multiple processors and when running ideal.exe, you must only use a single processor. When you run wrf.exe, that is when you can use multiple processors for the 3d cases. Try with just 1 processor and see if that gets you past this issue.
 
Thank you kwerner for moving this post to the proper forum! I am new to this system, so the assistance is very appreciated.

I was able to run WRF in serial when configuring with option 32 (serial, GNU (gfortran/gcc)) which is a good sign!

I have run the following code in the WRF directory to get a fresh configure and compile:
Code:
./clean -a
./configure
./compile em_b_wave &> log.compile

selecting options 34 and 1 as before. Everything else is also as before. log.compile reports that the executables were successfully built.

I've attached the body of the jobscript in jobscript-knl.txt, ensuring that I am using enough resources for the srun command and that ideal.exe is being run with a single process.

I've also attached the rsl.error and rsl.out files. The error file produced by Slurm is fairly identical to before

Code:
  1  starting wrf task            0  of            1                                                     
  2 [nid02516:229630] *** An error occurred in MPI_Comm_create_keyval
  3 [nid02516:229630] *** reported by process [3891855361,0]
  4 [nid02516:229630] *** on communicator MPI_COMM_WORLD
  5 [nid02516:229630] *** MPI_ERR_ARG: invalid argument of some other kind
  6 [nid02516:229630] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
  7 [nid02516:229630] ***    and potentially your MPI job)
  8 srun: Job 60189302 step creation temporarily disabled, retrying (Requested nodes are busy)
  9 srun: Step created for job 60189302
 10  starting wrf task            2  of            4
 11  starting wrf task            1  of            4
 12  starting wrf task            3  of            4
 13  starting wrf task            0  of            4
 14 srun: error: nid02516: task 0: Exited with exit code 255
 15 srun: launch/slurm: _step_signal: Terminating StepId=60189302.1
 16 srun: error: nid02519: task 3: Terminated
 17 srun: error: nid02518: task 2: Terminated
 18 srun: error: nid02517: task 1: Terminated
 19 srun: Force Terminated StepId=60189302.1

I can't relocate the source, but I read somewhere that WRF expects MPI v1 or v2, whereas Cori has Open MPI v4.1.2. Could that be causing the issue?
 

Attachments

  • log.compile
    816.8 KB · Views: 0
  • jobscript-knl.txt
    373 bytes · Views: 3
  • rsl.out.0001.txt
    912 bytes · Views: 1
  • rsl.out.0002.txt
    912 bytes · Views: 0
  • rsl.out.0003.txt
    912 bytes · Views: 0
  • rsl.out.0000.txt
    912 bytes · Views: 1
  • rsl.error.0003.txt
    949 bytes · Views: 0
  • rsl.error.0002.txt
    949 bytes · Views: 0
  • rsl.error.0001.txt
    949 bytes · Views: 1
  • rsl.error.0000.txt
    1.3 KB · Views: 2
Hi,
The failed MPI run definitely is an issue of MPI library. Please contact your computer manage for more information. Unfortunately without access to Cori at NERSC, we cannot help to fix the problem.
 
Hello,
I have the same problem too. Could I ask if the problem is solved and how to solve it? Thank you very much!
 
Top