Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WPS Compile Issue

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

cwrenn

Member
I have a problem that seems to be related to the compiler. I just completed installing wrf4.3 but did not have much luck compiling wps.
The very first error message that appears in the output file “compile_wps.log” indicates that the problem has something to do with the compiler (e.g., "gfortran: error: big_endian: no such file or directory," etc). Why, for example, am I receiving GNU compiler error messages when I loaded an Intel module?

The first thing that I did was to load module data/netCDF-WRF/C-4.6.2_CXX-4.3.0_F-4.4.2_p-1.9.0-intel-2018.5.274. The wrf4.3 compilation completed successfully but the WPS compilation failed.

When I asked $which mpif90, it returned /opt/apps/software/mpi/impi/2018.4.274-iccifort-2018.5.274-GCC-6.3.0-2.26/bin64/mpif90, which surprised me because it seemed as though a different (GNU) compiler was being used, which would also help explain the GNU "gfortran: error" messages that appear in compile_wps.log.

When I configured WPS, I chose the Linux x86_64 Intel compiler (dmpar), i.e., number 19, and the configure.wps file clearly states that it did in fact load the Intel compiler.

Since I just compiled WRF and WPS on another supercomputer, I don't think that it is WRF4.3, through I'm not sure.

I can send my configure.wps and compile_wps.log files if there is any interest in getting to the bottom of the problem. Not sure why I'm getting GNU ("gfortran") compiler error messages when I loaded an Intel module and chose the Intel compiler in the configuration. Any help would be appreciated. I'm not sure what question to ask my IT staff to check and see what might be the problem.

The steps that I used to compile WPS were as follows:

1. $ cd /WPS-4.3 /where the configure and compile commands are for WPS/
2. $ module avail /data /to get a clean copy of the module name/
3. $ module load data/netCDF-WRF/C-4.6.2_CXX-4.3.0_F-4.4.2_p-1.9.0-intel-2018.5.274
4. $ set SHELL = /bin/csh
5. $ echo $SHELL
6. $ setenv WRF_DIR /home/myusername/WRFV4.3/WRF-4.3 /so WPS can find the wrf files/
7. $ srun -p sandbox -c 2 –mem=48G -t 120 –pty /bin/bash
8. $ ./clean -a
9. $ ./configure /number 19 (Linux x86_64 Intel compiler (dmpar)) is selected/
10. $ ./compile >& compile_wps.log
11. $ more compile_wps.log /where the first error messages that appear are “gfortran: error: big_endian: No such file or directory”; “gfortran: error: unrecognized command line option ‘-convert’”
12. $ which mpif90 /which returned /opt/apps/software/mpi/impi/2018.4.274-iccifort-2018.5.274-GCC-6.3.0-2.26/bin64/mpif90/
 
Hi,
Yes, I would like to take a look at these four files:
configure.wrf
wrf compile log
configure.wps
wps compile log

Even if you're using an Intel compiler, many of the routines in the WRF code are written in fortran, so you still must have a fortran compiler, so the error can still be related to that. I can hopefully determine more when I see your files. Thanks!
 
Hey, kwerner! Thanks for getting back to me. The way that that particular problem was "solved" was by comparing two separate configure.wps files from two separate machines, one that compiled successfully and the other that did not: what a simple diff of the two configure files showed was that one had switches, -f90=$(SRC), -cc=$(SFC) in the mpi calls, and the other did not:

DM_FC = mpif90 -f90=$(SFC)
DM_CC = mpicc -cc=$(SCC)

DM_FC = mpif90
DM_CC = mpicc

Once I edited the configure.wps file, the expected executables were generated (geogrid.exe, ungrib.exe, and metgrid.exe), though I will admit that the compile_wps.log file did not end with the expected SUCCESS statement (nor did it generate the util/plotgrids_new executables).

Since then (this morning actually), I benefited greatly from one of your responses to a 31 July 2019 post, WRF V4.0 MPI run does not work with too many processors. In that response, you gave rules of thumb formulae for decomposition of processors as a function of tiles per domain:

For your smallest-sized domain:
((e_we)/25) * ((e_sn)/25) = most amount of processors you should use

For your largest-sized domain:
((e_we)/100) * ((e_sn)/100) = least amount of processors you should use


For my configuration, the numbers that were generated from the formulae were 18.9952 and 4.4521, which I found ironic as the problem that had been going on for quite a while seemed to have arisen when I chose --tasks-per-node=19 (suggested by our IT people) and --nodes=7. (I have since changed the slurm script to read --tasks-per-node=20, though my job has not left exited the queue.)

First of all, I'd like to say thanks for all of your cogent answers to so many inquiries that I have read in this Forum. Secondly, I would like to know if you could provide a little more detail of how to think about the problem of processor distribution, mpi and WRF4.0 more intuitively. I am intrigued by the idea of a constrained processor decomposition required by WRF4.0 and how MPI tries to do its job. Could you try and explain with more hardware and MPI related specificity? For instance, I am also trying to run jobs on Cheyenne but I have no insight into the (mpt?) syntax. The relevant line is $PBS -l select=2:ncpus=36:mpiprocs=36. Since I have no insight into the syntax or the hardware related specifics of the decomposition, I have been unsuccessful in increasing the speed of the computation, select=3 and select=4 do not work with ncpus=36:mpiprocs=36. Any additional information that you could provide for how to intelligently think about these constrained combinatorics puzzles would be greatly appreciated as I need to have a much better grasp of how to do this for the diverse types of wrf jobs that I plan to run in the future.
Mahalo nui!
 
Hi,
I'm so glad that you were able to get past the compiling problem, and thank you so much for updating the post with the method you used to resolve it. This may help someone else in the future!

I first would like to make some clarification to my previous post regarding using and Intel and fortran compiler. I spoke to one of our software engineers, who was able to explain it in a much more eloquent manner:
Intel provides a suite of C, C++ and Fortran compilers that are usually referred to collectively as the Intel compilers; similarly, there is a GNU compiler suite that includes C, C++, and Fortran compilers. In the case of the Intel compilers, the C, C++, and Fortran compilers are named, respectively, icc, icpc, and ifort, while in the case of the GNU compilers, the names of the specific compilers are gcc, g++, and gfortran.

So, if the user has loaded an Intel compiler module on their system, they should be using the ifort Fortran compiler and not the gfortran Fortran compiler. In other words, "gfortran" is the name of the GNU Fortran compiler, rather than the generic name of any Fortran compiler.


Now to your question about processor decomposition, it may help if you attach your namelist.input file so that I can see the size of your domains. Thanks!
 
hey, kwerner--it's good to here from you again. your software engineer's response seems to sort of like a zen koan. I'll think about it. What my second query asked if there's a more insightful way to think about the relationship between how MPI manages tasks that are constrained by WRF4.3 and what role, if any, the slurm daemon/scheduler might play in sorting it all out. On Cheyenne, I can only run 72 tasks/processors and on the UH-HPC, I can run as many as 80 tasks. When I put my numbers into your rules of thumb, I got 4.45 and 18.99 (see attached namelist.input file). thanks again.
 

Attachments

  • Namelist_input.txt
    5.6 KB · Views: 43
This reply is regarding the number of processors that you can use for WRF.

Here is a sample namelist.input file that has the required entries to determine the maximum number of MPI processes that are allowed.

Code:
&domains
 max_dom                             = 3,
 e_we                                = 180,    211,   112,
 e_sn                                = 180,    211,   106,

Using this namelist, here are the steps:

  • Build the WRF code with the DM (distributed memory) option, which uses the MPI library. Otherwise, you are running on one processor for serial builds, or with a maximum of a single node for OpenMP.
  • The minimum number of cells in each direction for a decomposed domain is 10. Look in all domains for the minimum value of e_we and e_sn. In the namelist snippet above, those values are listed as e_ew=112 and e_sn=106.
  • Use integer division for e_we/10 and e_sn/10. In this case that gives 112/10=11 and 106/10=10.
  • The largest number of computational MPI processes (that means that we are excluding any of the quilt I/O servers) is the product of these values: 11x10=110. This maximum value is always achievable if the MPI decomposition namelist entries are used.
  • For this particular namelist, in the &domains namelist record, set nproc_x=11 and nproc_y=10.
  • You need to modify your job submission script to handle the requested number of MPI processes. It is not completely crazy to somewhat tune the domain sizes to the hardware. For example, making the e_we=120 allows a decomposition of 12x9=108 MPI processes, which evenly fits across three 36 cores/node.
  • Note that the minimum e_we could be on a different domain that e_sn.
 
Aloha davegill!
This is awesome news! I can't wait to try it. My preliminary runs are winding down. As soon as they finish, I will implement your suggestions. I had chosen dmpar when I configured wrf4.3 so I think I'm running on all the cores.

Do you mind if I ask you or one of your colleagues a related question? For some reason, I have not been able to get Cygwin/X, the X-11 terminal emulator (which plotgrids_new.ncl requires to run), to install properly on my PC, even though I followed the steps in the User's Guide. One thing that I did notice however as I tried to install it, was that the option Unix/binary never appeared. Every time I installed Cygwin, the "X" icon never appears. I have written a couple of times to the ncl-install email address that is given in the User's Guide, but there doesn't seem to be anyone there.

I really appreciate your having gotten in touch with me about my processor distribution problem. I am in the process of designing a new series of experiments that will require careful optimization of the ntasks and the tiles to try and attain maximum performance and overall wrf output quality.

Mahalo nui!
Chris
 
Part of our "searchability" capability that we want to enable on this forum comes down to keeping the content associated closely with the thread's title.

Please take the new cygwin issue and repost it as a new question to the forum.
 
WRF & MPAS-A Support Forum--

When I included nproc_x = 11 and nproc_y =10 into &domains into namelist.input, I was returned the following error message when I tried to run real.exe:

module_io_quilt_old.F 2931 T
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 5711
check comm_start, nest_pes_x, nest_pes_y settings in namelist for comm

The error seems to be related to the attempted decomposition. Any suggestions?
Thanks
 
I tried to make the same changes to the namelist.input file on the UH-HPC, and got a similar, if slightly more detailed error message when trying t run real:

taskid: 0 hostname: node-0005
module_io_quilt_old.F 2931 T
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 5711
check comm_start, nest_pes_x, nest_pes_y settings in namelist for comm
-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

Thanks
 
What is the number of processors that you are able to use per node? Can you use an entire node? If so, specify various factorizations of the "single node" core count. For example, if you have 40 cores, try nproc_x and nproc_y to be: 8,5; 5,8; 10,4; 4,10. This error may simply be the disconnect between you job script (where you specify how many MPI tasks) and the namelist specification of the number of MPI tasks.
 
Dave--Thanks. I can't wait to try it. I can run on 4 nodes running 20 tasks each without the re-distribution that you suggest. Each node has 20 cores. I don't think that it has anything to do with the slurm script: these error messages are coming when I try to run real.exe.
 
Well, I tried various combinations of nproc_x and nproc_y and nothing worked, not even the default values -1, -1.

I did find an interesting 2020 paper, Moreno et al., "Analysis of a New MPI Process Distribution for the Weather Research and Forecasting (WRF) Model" (https://downloads.hindawi.com/journals/sp/2020/8148373.pdf), which basically presents evidence to show that orthogonal distributions are sub-optimal, i.e., X x Y (6 x 6) compared to distributions that are more skewed (if that's the right word), i.e., (4 x 20).

I was surprised at the paucity of information on nproc_x and nproc_y in both Version 3 (2012) and Version 4 (2019) of the ARW Modeling System User's Guide.

It seems that some other information needs to be provided to namelist.input to be able to change the values of nproc_x and nproc_y.

I never had this problem when I did numerical experiments with WRF3.7 where I could use up to 400 processors. But I usually chose 300 or 140 and it was fast and six month rainfall runs could be rainfall validated with good results. But hopefully, WRF4.3 will provide better more quantitatively accurate rainfall though it runs at a snail's pace: a one month simulation takes one week of wall time. Using WRF3.7, I could simulate 6-months in 3 days.
 
I just selected some new domain settings (for a different experiment) and I was curious to know what you would calculate to be the maximum number of processors given this configuration? And, alternatively, can these domain settings be jiggled around a little to increase the maximum processor number?

i_parent_start = 1, 52, 101,
j_parent_start = 1, 54, 104,
e_we = 180, 250, 220,
e_sn = 180, 241, 220,

The reason that I ask is b/c the maximum number of processors that I have been able to run on so far is 240 (12 x 20). However, compared to the earlier experiment (4 x 20), the performance is even worse: instead, of taking 6 weeks to simulate 6 months, with these domains it will take 9 weeks.
 
**Note: Attached script has been updated to include most recent modifications. The most recent version will always be available here.

Hi,
I'm attaching a python script I wrote that helps you determine the max number of processors you can use, based on the grid size and how many processors are in each node on your system. You'll need to make some modifications to cater this script to your particular run and environment set-up, but hopefully it will help you.

As for reconfiguring the size of the domains - yes, you can resize them to find your max potential, as long as it still works for you. You will need to re-run geogrid, metgrid, and real once you find the best domain size.
 

Attachments

  • number_of_procs.py
    4.2 KB · Views: 45
kwerner--Thanks for the python script. I'll see if I can use it to modify and optimize my new wrf4.3 simulation. I am also planning to repeat the same run with wrf3.7 to see if I can increase the number of processors. I had great luck with wrf3.7. Because of the considerable speed and performance allowed by wrf3.7, I was able to do extensive optimization runs, which included processor number and a series of rainfall validation experiments. There still may be one catch: Since the uh-hpc upgrade, the tmi fabric is no longer supported. Comparing all of the different fabrics, tmi well outperformed all of the others. Now, shm:eek:fi is being used, so I still have to check how much that may be degrading the computational performance. Mahalo nui loa for all of your useful and timely help! Chris.
 
kwerner--I entered various e_we and e_sn values into number_of_procs.py and all of the values that were produced were larger than what I have been able to run. Any idea why?

For the HPC (A), according to number_of_procs.py, my first experiment (Exp 1) should be able to run on 140 processors and 7 nodes, but the most I could run was 80 processors and 4 nodes. In Exp 2, using the smallest values (180,180), I should be able to run 300 processors and 15 nodes, instead it is running on 240 processors and 12 nodes.

For Cheyenne (B), your script suggests that for my first experiment (Exp 1) I should be able to run 108 processors and 3 nodes, but I tried those values and wrf failed. The most I could run was 72 processors and 2 nodes. In Exp 2, using the smallest values (180,180), I should be able to run 540 processors and 15 nodes (I haven't tried to run this experiment on Cheyenne yet).

A. For processors/node = 20

Exp 1.
e_we = 180, 211, 112,
e_sn = 180, 211, 106,

using 112, 106 --> processors = 140, nodes = 7

Exp 2.
e_we = 180, 250, 220,
e_sn = 180, 241, 220,

using 180, 180 --> processors = 300, nodes = 15

using 220, 220 --> processors = 620, nodes = 31

(Interesting thing is that 300 processors with 15 nodes and 140 processors with 7 nodes is what I had successfully run historically using WRF3.7.)

B. For processors/node = 36

Exp 1.
e_we = 180, 211, 112,
e_sn = 180, 211, 106,

using 112, 106 --> processors = 108, nodes = 3

Exp 2.
e_we = 180, 250, 220,
e_sn = 180, 241, 220,

using 180, 180 --> processors = 540, nodes = 15

using 220, 220 --> processors = 540, nodes = 15
 
Hi,
I apologize, as there was an error in the script. Find the final line of the script:
Code:
cores = (cores + cores)
and change that to
Code:
cores = (cores + cores_orig)

That should provide better numbers for you. Make sure you're also modifying the value for "cores" in line 14 to either 20 or 36, depending on the machine.
 
kwerner--thanks for the corrected python script. The new numbers are interesting. For Cheyenne in Exp 1
e_we = 180, 211, 112,
e_sn = 180, 211, 106,

using 112, 106 --> [processors = 72, nodes = 2], which is what's running now. No change.

For the HPC
Exp 1.
e_we = 180, 211, 112,
e_sn = 180, 211, 106,

using 112, 106 --> [processors = 100, nodes = 5], suggesting that I can add a node for this run [previously ran with processors = 80, nodes = 4].

Exp 2.
e_we = 180, 250, 220,
e_sn = 180, 241, 220,

using 180, 180 --> [processors = 180, nodes = 9], which implies that I'm now running too many processors [processors =240, nodes = 12]. When I restart the run, I'll change the number of processors to 180 and number of nodes to 9 and let you know how it goes.
Mahalo nui!
 
Top