Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Real.exe bad file descriptor

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

sym04110

New member
I am currently running the latest version of WRF with the polar modifications (http://polarmet.osu.edu/PWRF/) on Cheyenne. I installed WRF and WPS to my work directory earlier this week and have been able to successfully run all of WPS. However, when I try to run real.exe I get the following error:

Code:
ctrl_vsend/writev failed: Bad file descriptor

No other output is generated. I do not get an rsl.error or an rsl.out file.

I am submitting this to the scheduler using the script attached. I've also attached both of my namelist files.

Thank you,
Sarah
 

Attachments

  • namelist.input.txt
    2.4 KB · Views: 56
  • namelist.wps
    779 bytes · Views: 58
  • WRF_real.script.txt
    269 bytes · Views: 66
Hi Sarah,
Do you mind providing the path to your WRF directory, since we have access to Cheyenne? It will make it easier for us to take a look at your files. Thanks!
 
The path to my work directory is: /glade/work/sarahm
WRF is in /glade/work/sarahm/WRF/run
WPS is in /glade/work/sarahm/WPS

The path to my home directory is: /glade/u/home/sarahm


Thank you!
 
I think the problem may be the fact that you are issuing this as a serial execution, but the WRF code is built with the distributed memory option. I'm not sure which MPI you used, but in my batch script, I use:
Code:
mpiexec_mpt ./real.exe
to execute real. I use mpt/2.19
 
You're right, that was my issue. Thank you for the reply!

After running real.exe, I try to run wrf.exe I get the following output:

Code:
 starting wrf task            2  of           20
 starting wrf task            3  of           20
 starting wrf task            9  of           20
 starting wrf task            1  of           20
 starting wrf task            0  of           20
 starting wrf task            4  of           20
 starting wrf task            5  of           20
 starting wrf task            6  of           20
 starting wrf task            7  of           20
 starting wrf task            8  of           20
 starting wrf task           14  of           20
 starting wrf task           16  of           20
 starting wrf task           18  of           20
 starting wrf task           12  of           20
 starting wrf task           17  of           20
 starting wrf task           13  of           20
 starting wrf task           11  of           20
 starting wrf task           10  of           20
 starting wrf task           19  of           20
 starting wrf task           15  of           20
MPT ERROR: MPI_COMM_WORLD rank 7 has terminated without calling MPI_Finalize()
	aborting job
MPT: Received signal 11

Is this termination due to a similar type of issue? My job submission script is the following (with my project code removed) :

Code:
#!/bin/bash
#PBS -N pwrf_nicefeb
#PBS -A xxxxxxxx
#PBS -j oe                                
#PBS -o pwrf_nicefeb.log
#PBS -q regular                           
#PBS -l walltime=02:00:00
#PBS -l select=1:ncpus=20:mpiprocs=20

cd /glade/work/sarahm
cd WRF/run

mpiexec_mpt ./wrf.exe
 
Hi,
It's difficult to say why the run is terminating, but this is not due to the batch script anymore. This is the model stopping because it does attempt to run (take a look at one of your rsl* files). I think likely the problem is due to the fact that either your domain 01 is too small (yours is 50x50 and we don't want you to run with anything smaller than 100x100), and/or because you're using too many processors for the size of the domain. Take a look at this FAQ that discusses how to choose the right number of processors, based on the domain size:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=5082
and also take a look at this page, which can help you to set-up your domain in a reasonable way:
https://www2.mmm.ucar.edu/wrf/users/namelist_best_prac_wps.html
click on the namelist variables for detailed descriptions, as well as best practices.

Another thing to mention: I see this in in your rsl.error.* files:
Code:
open_hist_w : error opening /glade/work/sarahm/wrfout/wrfout_d01_2015-02-01_00:00:00 for writing. ***
You have this set in your namelist.input file:
Code:
history_outname = '/glade/work/sarahm/wrfout/wrfout_d<domain>_<date>',
so the model is trying to write to that location. When I go into /glade/work/sarahm/, I don't see a directory called 'wrfout.' This error is not going to stop the model from running, but you aren't going to get any data written anywhere if this directory doesn't exist. You should create that directory before running.
 
Hi,

Thank you for the response and for looking through my error files. You're right, that directory did not exist, I've created it now.

I expanded my domains so none of them are smaller than 100 x 100 and read through the documentation about selecting processors. For my new domain, I felt that 16 processors seemed appropriate. I'm still getting a similar error. I experimented with the number of processors and my run still terminates immediately.

This is the last line of my rsl.error.0000 file:
Code:
MPT: Program /glade/work/sarahm/WRF/main/wrf.exe, Rank 0, Process 29915: Core dump on signal SIGSEGV(11) suppressed.

and my job submission output gives this:
Code:
MPT ERROR: MPI_COMM_WORLD rank 3 has terminated without calling MPI_Finalize()
	aborting job
MPT: Received signal 11

Do you still feel that this is an issue with the domain and processor selection? Could I have done something incorrectly when compiling the model?

Thanks again!
 
Hi,
No, it's unlikely that is the cause. When the model just stops, it can be tough to track down the exact issue.
1) First check to make sure you have space where you are trying to write the files. Issue a 'gladequota' to see how much space is left on your /glade/work/ disk. That disk is smaller than some of the other places (like /glade/scratch) so it's worth checking.
2) Many times when the model stops soon after starting, it can be a problem with the input data. You can glance at your input files to see if you see anything "off" about them. If not, I would suggest running just a single domain first to see if that runs. If so, then you can try 2 domains, and if that runs, then you know the problem is with d03. That still doesn't tell you what the problem is, but it's helpful for further investigation.
 
Top