Real.exe bad file descriptor

sym04110 · Feb 21, 2020

I am currently running the latest version of WRF with the polar modifications (http://polarmet.osu.edu/PWRF/) on Cheyenne. I installed WRF and WPS to my work directory earlier this week and have been able to successfully run all of WPS. However, when I try to run real.exe I get the following error:

Code:

ctrl_vsend/writev failed: Bad file descriptor

No other output is generated. I do not get an rsl.error or an rsl.out file.

I am submitting this to the scheduler using the script attached. I've also attached both of my namelist files.

Thank you,
Sarah

kwerner · Feb 21, 2020

Hi Sarah,
Do you mind providing the path to your WRF directory, since we have access to Cheyenne? It will make it easier for us to take a look at your files. Thanks!

sym04110 · Feb 21, 2020

The path to my work directory is: /glade/work/sarahm
WRF is in /glade/work/sarahm/WRF/run
WPS is in /glade/work/sarahm/WPS

The path to my home directory is: /glade/u/home/sarahm

Thank you!

kwerner · Feb 24, 2020

I think the problem may be the fact that you are issuing this as a serial execution, but the WRF code is built with the distributed memory option. I'm not sure which MPI you used, but in my batch script, I use:

Code:

mpiexec_mpt ./real.exe

to execute real. I use mpt/2.19

sym04110 · Mar 2, 2020

You're right, that was my issue. Thank you for the reply!

After running real.exe, I try to run wrf.exe I get the following output:

Code:

 starting wrf task            2  of           20
 starting wrf task            3  of           20
 starting wrf task            9  of           20
 starting wrf task            1  of           20
 starting wrf task            0  of           20
 starting wrf task            4  of           20
 starting wrf task            5  of           20
 starting wrf task            6  of           20
 starting wrf task            7  of           20
 starting wrf task            8  of           20
 starting wrf task           14  of           20
 starting wrf task           16  of           20
 starting wrf task           18  of           20
 starting wrf task           12  of           20
 starting wrf task           17  of           20
 starting wrf task           13  of           20
 starting wrf task           11  of           20
 starting wrf task           10  of           20
 starting wrf task           19  of           20
 starting wrf task           15  of           20
MPT ERROR: MPI_COMM_WORLD rank 7 has terminated without calling MPI_Finalize()
	aborting job
MPT: Received signal 11

Is this termination due to a similar type of issue? My job submission script is the following (with my project code removed) :

Code:

#!/bin/bash
#PBS -N pwrf_nicefeb
#PBS -A xxxxxxxx
#PBS -j oe                                
#PBS -o pwrf_nicefeb.log
#PBS -q regular                           
#PBS -l walltime=02:00:00
#PBS -l select=1:ncpus=20:mpiprocs=20

cd /glade/work/sarahm
cd WRF/run

mpiexec_mpt ./wrf.exe

kwerner · Mar 2, 2020

Hi,
It's difficult to say why the run is terminating, but this is not due to the batch script anymore. This is the model stopping because it does attempt to run (take a look at one of your rsl* files). I think likely the problem is due to the fact that either your domain 01 is too small (yours is 50x50 and we don't want you to run with anything smaller than 100x100), and/or because you're using too many processors for the size of the domain. Take a look at this FAQ that discusses how to choose the right number of processors, based on the domain size:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=5082
and also take a look at this page, which can help you to set-up your domain in a reasonable way:
https://www2.mmm.ucar.edu/wrf/users/namelist_best_prac_wps.html
click on the namelist variables for detailed descriptions, as well as best practices.

Another thing to mention: I see this in in your rsl.error.* files:

Code:

open_hist_w : error opening /glade/work/sarahm/wrfout/wrfout_d01_2015-02-01_00:00:00 for writing. ***

You have this set in your namelist.input file:

Code:

history_outname = '/glade/work/sarahm/wrfout/wrfout_d<domain>_<date>',

so the model is trying to write to that location. When I go into /glade/work/sarahm/, I don't see a directory called 'wrfout.' This error is not going to stop the model from running, but you aren't going to get any data written anywhere if this directory doesn't exist. You should create that directory before running.

sym04110 · Mar 3, 2020

Hi,

Thank you for the response and for looking through my error files. You're right, that directory did not exist, I've created it now.

I expanded my domains so none of them are smaller than 100 x 100 and read through the documentation about selecting processors. For my new domain, I felt that 16 processors seemed appropriate. I'm still getting a similar error. I experimented with the number of processors and my run still terminates immediately.

This is the last line of my rsl.error.0000 file:

Code:

MPT: Program /glade/work/sarahm/WRF/main/wrf.exe, Rank 0, Process 29915: Core dump on signal SIGSEGV(11) suppressed.

and my job submission output gives this:

Code:

MPT ERROR: MPI_COMM_WORLD rank 3 has terminated without calling MPI_Finalize()
	aborting job
MPT: Received signal 11

Do you still feel that this is an issue with the domain and processor selection? Could I have done something incorrectly when compiling the model?

Thanks again!

kwerner · Mar 9, 2020

Hi,
No, it's unlikely that is the cause. When the model just stops, it can be tough to track down the exact issue.
1) First check to make sure you have space where you are trying to write the files. Issue a 'gladequota' to see how much space is left on your /glade/work/ disk. That disk is smaller than some of the other places (like /glade/scratch) so it's worth checking.
2) Many times when the model stops soon after starting, it can be a problem with the input data. You can glance at your input files to see if you see anything "off" about them. If not, I would suggest running just a single domain first to see if that runs. If so, then you can try 2 domains, and if that runs, then you know the problem is with d03. That still doesn't tell you what the problem is, but it's helpful for further investigation.

Real.exe bad file descriptor

sym04110

New member

Attachments

kwerner

Administrator

sym04110

New member

kwerner

Administrator

sym04110

New member

kwerner

Administrator

sym04110

New member

kwerner

Administrator