Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Inserting write-statements into PIO

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Carl Ponder

New member
I'm trying to debug a problem I'm seeing with MPAS-A running with PIO (and other components).
I'm able to insert statements like this
Code:
call mpas_log_write('Debug checkpoint 2')
into the MPAS-A source, and see the corresponding output.
But if I insert this into the PIO source I don't see any output
Code:
write(6,*) "Debug checkpoint 2.1"
suggesting to me that MPAS-A is re-mapping the I/O streams somehow.
Are there channels I can use with fortran/write and C/printf to make sure the output makes it to the screen?
 
Nor did these show up:
Code:
USE ISO_FORTRAN_ENV, ONLY : ERROR_UNIT, OUTPUT_UNIT
....
write(OUTPUT_UNIT,*) "Debug 2.0 (output)"
flush(unit=OUTPUT_UNIT)
write(ERROR_UNIT,*) "Debug 2.0 (output)"
flush(unit=ERROR_UNIT)
 
Before the introduction of the logging module in MPAS v6.0, we did re-direct stderr and stdout to log files; but, since MPAS v6.0, this is no longer the case. I've had no problems using simple "write(0,*)" statements to debug PIO in the past; is it possible that there's an issue with the batch system not capturing stdout/stderr from the jobs you're running?
 
(also logged as an MPAS-A issue on this page
https://github.com/MPAS-Dev/MPAS-Model/issues/609
but closed, to carry the discussion here instead)
 
Our SLURM system has never shown this sort of problem.
I'm using the OpenACC version of MPAS-A which may be a snapshot prior to your un-doing the re-direction.
I believe that streams 2 & 3 usually make it to the screen in C, are there any other Fortran units that I can try using?
 
Is execution getting to the output statements in PIO? Does a write statement in a higher level MPAS function such as a the run function in mpas_atm_core.F or mpas_init_atm_core.F produce output? Is it possible to open a file a write to it?

I'm using the OpenACC version of MPAS-A which may be a snapshot prior to your un-doing the re-direction.

If this is the case, then this would most likely be the cause of the problem you are experiencing. Can you confirm if you are using pre or post v6.0?

I believe that streams 2 & 3 usually make it to the screen in C, are there any other Fortran units that I can try using?

Values 1 and 2 are the output streams for C (stdout and stderr respectively). Either should produce output (if those streams aren't being redirected). If the stdout and stderr aren't being redirected, than streams 0 (stderr) and 6 (stdout) should produce output to the terminal in Fortran.
 
The routine in the PIO is called by the MPAS-A routine where this was called:

call mpas_log_write('Debug checkpoint 2')

The above did work. Once it got down into the PIO then I can get any output through.
I'll try working with the MPAS-A 7.0 code instead, it's showing the same failure as the OpenACC code.
Have you ever tested it with PGI 20.5 and OpenMPI 4.0.4?
 
Same issue with MPAS-A 7.0 source code.
This call from the file MPAS-Model-7.0/src/framework/mpas_io.F

1328 call mpas_log_write('Carl-debug checkpoint 1')
1329 call PIO_initdecomp(handle % ioContext % pio_iosystem, pio_type, dimlist, compdof, new_decomp % decomphandle % pio_iodesc)
1330 call mpas_log_write('Carl-debug checkpoint 2')

gives output from the first line but evidently the call never returns.
Putting write-statements into the PIO_initdecomp doesn't produce any output.

Also, I'd tried to use write statements in the other MPAS-A snapshot i have, but haven't been able to get any output from there either.
What unit is the mpas_log_write function using?
 
The OpenACC version of MPAS-A is based on MPAS v6.1; that, combined with the fact that the mpas_log_write function exists, suggests that there shouldn't be an issue of having code that still redirects stdout/stderr.

As a test, you could try adding a write(0,*) statement before and after the call to mpas_init in src/driver/mpas.F . If you can see the write statement before the call to mpas_init, but not the write statement after, that would suggest that stderr is somehow getting redirected by MPAS; but, if you can't see either write statement, then it suggests an issue with the batch system or with buffering of stdout/stderr.
 
The unit numbers used by stand-alone MPAS-Atmosphere aren't hard-wired anywhere in the code, but are determined at runtime. Fortunately, they should be set deterministically. This code block in mpas_log.F sets the Fortran unit numbers used by the log files.
 
Using this
Code:
     13    write(0,*) "Carl-debug: Checkpoint 0.1"
     14    call mpas_init()
     15    write(0,*) "Carl-debug: Checkpoint 0.2"
I get the first output. The second doesn't show up because the mpas_init is dying.
But it does still work when in gets down to here in ./MPAS-Model-7.0/src/framework/mpas_io.F
Code:
   1327       dimlist(ndims) = field_cursor % fieldhandle % dims(ndims) % dimsize
   1328           call mpas_log_write('Carl-debug checkpoint 1')
   1329    write(0,*) "Carl-debug: Checkpoint 1"
   1330       call PIO_initdecomp(handle % ioContext % pio_iosystem, pio_type, dimlist, compdof, new_decomp % decomphandle % pio_iodesc)
but similar write statements don't work inside the PIO call.

That being said, maybe the failure is in the de-referencing of the parameters rather than inside the PIO call itself.
The failure-messages I see from the execution are these
Code:
+ mpirun -n 1 -N 1 ../MPAS-Model-7.0/atmosphere_model.2020-06-08
 Carl-debug: Checkpoint 0.1
 Carl-debug: Checkpoint 1
[prm-dgx-04:26367:0:26367] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fffd3055ff8)
==== backtrace (tid:  26367) ====
 0  /gpfs/fs1/SHARE/Utils/UCX/1.8.0/GCC-BASE-7.4.0_CUDA-11.0.1.0_450.36.06/lib/libucs.so.0(ucs_handle_error+0x124) [0x7f1386910de4]
 1  /gpfs/fs1/SHARE/Utils/UCX/1.8.0/GCC-BASE-7.4.0_CUDA-11.0.1.0_450.36.06/lib/libucs.so.0(+0x24215) [0x7f1386911215]
 2  /gpfs/fs1/SHARE/Utils/UCX/1.8.0/GCC-BASE-7.4.0_CUDA-11.0.1.0_450.36.06/lib/libucs.so.0(+0x244a9) [0x7f13869114a9]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7f138904e890]
 4  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9de67c]
 5  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9ea283]
 6  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9ecc84]
 7  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9e6e0f]
 8  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9c2069]
 9  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9c2202]
10  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x9045bf]
11  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x91aa2e]
12  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x917090]
13  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x40d99c]
14  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x40c24d]
15  ../MPAS-Model-7.0/atmosphere_model.2020-06-08() [0x40c1d3]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 26367 on node prm-dgx-04 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The UCX errors suggest that the failure has something to do with MPI, but given that the backtrace doesn't show any PIO layer, I'm thinking that it's dying on line 1330 of the MPAS-A code and the UCX messages are due to an unclean exit instead.
 
The backtrace details are a little different if I compile with DEBUG=on

[prm-dgx-30:74879:0:74879] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fff65ac2ff8)
==== backtrace (tid: 74879) ====
0 /gpfs/fs1/SHARE/Utils/UCX/1.8.0/GCC-BASE-7.4.0_CUDA-11.0.1.0_450.36.06/lib/libucs.so.0(ucs_handle_error+0x124) [0x7f911e064de4]
1 /gpfs/fs1/SHARE/Utils/UCX/1.8.0/GCC-BASE-7.4.0_CUDA-11.0.1.0_450.36.06/lib/libucs.so.0(+0x24215) [0x7f911e065215]
2 /gpfs/fs1/SHARE/Utils/UCX/1.8.0/GCC-BASE-7.4.0_CUDA-11.0.1.0_450.36.06/lib/libucs.so.0(+0x244a9) [0x7f911e0654a9]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7f91207a2890]
4 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(pioassert+0xc) [0x15fd55c]
5 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(coord_to_lindex+0x63) [0x1609163]
6 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(box_rearrange_create+0x994) [0x160bb64]
7 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(PIOc_InitDecomp+0x8bf) [0x1605cef]
8 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(piolib_mod_pio_initdecomp_internal_+0x3d9) [0x15e0f49]
9 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(piolib_mod_pio_initdecomp_dof_i8_+0xd2) [0x15e10e2]
10 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(mpas_io_mpas_io_set_var_indices_+0x6431) [0x13bf051]
11 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(mpas_bootstrapping_mpas_io_setup_cell_block_fields_+0x8d6) [0x14846a6]
12 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(mpas_bootstrapping_mpas_bootstrap_framework_phase1_+0x16b6) [0x147d966]
13 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(mpas_subdriver_mpas_init_+0x3c6b) [0x44211b]
14 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(MAIN_+0xae) [0x43e41e]
15 ../MPAS-Model-7.0/atmosphere_model.2020-06-08(main+0x33) [0x43e353]

Maybe the PIO layer was not reported before because the PIO library is compiled-in as a ".a" archive instead of a ".so" shared-object, so the backtrace shows the execution being in the MPAS-A source rather than the PIO.
 
It looks like the reason the writes aren't working is because I put them in the wrong subroutine.
They're called through an overloaded interface and the call-stack is referring to a different one.
I'm going to close this issue. Hopefully I'll make more progress figuring out where the failure is.
 
Thanks very much for following-up. It's good to know that there isn't any redirection of stdout/stderr at fault. I remember running into similar problems when debugging PIO with print statements in the past, too -- it took me a few wrong guesses before I put my write statements in the correct implementation of an overloaded interface.

There isn't really a "close" option for discussion threads, so no worries there.
 
Top