Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

FATAL ERROR: data in update host clause was not found on device 1: name=areacell

Dear concern,

I have compiled MPAS-A 6xx-openacc with PGI compiler using Nvidia-hpc sdk 22.9 (in GPU nvidia 2080 super). In model integration section, when i run "mpiexec -n 4 ./atmosphere_model", I am getting following error,
FATAL ERROR: data in update host clause was not found on device 1: name=areacell
file:/hdd/rabbani/MPAS-A-PGI-OACC/MPAS-Model/src/core_atmosphere/physics/mpas_atmphys_driver_sfclayer.F driver_sfclayer line:962

I have attached all namelists & log files.

Why does this problem occur? How to solve it?

Thanks in advance.
 

Attachments

  • mpirun.log
    23.1 KB · Views: 5
  • log.atmosphere.role03.0000.out.txt
    10.2 KB · Views: 3
  • streams.atmosphere.txt
    1.5 KB · Views: 2
  • namelist.atmosphere.txt
    1.8 KB · Views: 5

gdicker

New member
The atmosphere/v6.x-openacc branch was developed with lagged radiation as the intended use, but your code was run with synchronous radiation. To enable lagged radiation, you will want to consult the "Running" page for the GPU-enabled MPAS branch (here: Running — GPU-enabled MPAS-Atmosphere v6,gpu documentation). This page will also point you to the description of the lagged radiation feature if you're interested.

Please follow up if you have any other questions or problems!
 
Dear @gdicker
Thanks for your reply.

Following the methods, i have set,
export MPAS_DYNAMICS_RANKS_PER_NODE=8
export MPAS_RADIATION_RANKS_PER_NODE=4
mpirun -np 12 ./atmosphere_model

but still it fails and shows,
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[20851,1],3]
Exit code: 1
and in log.atmosphere.role02.0000.out, it stops at,
Begin timestep 2014-09-10_00:00:00
----------------------------------------------------------------------
Read 'surface' input stream valid at 2014-09-10_00:00:00
Timing for stream input: 0.480074E-02 s

----------------------------------------------------------------------
--- time to update background surface albedo, greeness fraction.
--- time to run the LW radiation scheme L_RADLW =T
--- time to run the SW radiation scheme L_RADSW =T
--- time to run the convection scheme L_CONV =T
--- time to apply limit to accumulated rainc and rainnc L_ACRAIN =F
--- time to apply limit to accumulated radiation diags. L_ACRADT =F
--- time to calculate additional physics_diagnostics =F
Same thing happens for MPAS_DYNAMICS_RANKS_PER_NODE=4 or & MPAS_RADIATION_RANKS_PER_NODE=2 etc.

Can you please help for solving this issue? If you need any information, please let me know.
 
Last edited:

gdicker

New member
@rabbanidu93,

I think I need some more context to figure this out. In your reply could you please attach your full mpirun.log, log.atmosphere.role01.0000.out, log.atmosphere.role02.0000.out, and one log.atmosphere.*.err for each role (if present)?

So there isn't as much content in the logs, you might want to run with only 2 ranks for RADIATION and DYNAMICS while debugging.
 
Dear @gdicker ,

I have attached all log files. There is no log.atmosphere.*.err file.

Now I have set again,
export MPAS_DYNAMICS_RANKS_PER_NODE=8
export MPAS_RADIATION_RANKS_PER_NODE=4
mpirun -np 12 ./atmosphere_model

& I am getting,
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21333,1],10]
Exit code: 1
 

Attachments

  • log.atmosphere.role02.0000.out.txt
    15.2 KB · Views: 1
  • log.atmosphere.role01.0000.out.txt
    8.8 KB · Views: 2

gdicker

New member
@rabbanidu93,

You might want to try narrowing down the error some more. When I've had to do this, the NVCOMPILER_ACC_NOTIFY environment variable was helpful. It's documented here in Section 6.6 for NVHPC SDK v22.2 (though any version should work). Note that this can be VERY verbose and you should really limit the number of ranks you run with. A successful run of a different OpenACC branch with NVCOMPILER_ACC_NOTIFY=7, had about 2.5 million lines in my equivalent of your mpirun.log.

Although, if you find another method of debugging more successful, please share that as well!
 

MyAtmosphere

New member
Any update on this thread? I'm hitting the same error:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Although I get history and diag files with some data.
 
Last edited:
Top