Allocation error when number of trajectories is increased beyond 1000?

smhitch · Feb 4, 2021

Hi all,

I've successfully done a run with the online trajectory code in V4.2 (yay!) for less than 1000 trajectories. I wanted to make what I thought would be two relatively simple modifications.

1) Increase the buffer from 1000 to 1080, so that trajectories are written to the output files at times more in sync with my wrfout and restart files. I did this simply by modifying the vals_max parameter in /share/module_trajectory.F.
2) Increase the number of trajectories. I started with a modest increase to 1500. I modified this in /share/module_trajectory.F, and then modified the module_initialize_real.F code trajectory initialization section to reflect this increase for my needs (in the same format as the working version, just with more points). I also created wrfinput_traj_d0x files that fit my needs (in the same format of the working version), but for a subsample of the # of trajectories just as a trial (as I did when I was testing my current working version).

When I compile both of these, initially, the compile log has some comments that don't usually occur about implicit declarations, and a fatal error: 'compilation aborted for module_alloc_space_6.f90 (code 1)'. If I try and compile a second time, the executables are built successfully...

As a stand alone, the first modification works *most* of the time. Every once and a while, I get an allocation error. A colleague has suggested that perhaps I may be near the bounds of the memory allocated for specific arrays? But this approaches the bounds of my WRF, FORTRAN, and MPI knowledge....

I mention the first goal and error, because when I try and implement the second goal, I consistently get an allocation error:
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14c108c6a2e0)
==== backtrace (tid: 19948) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x000000000139f8dc module_trajectory_mp_trajectory_init_() ???:0
2 0x0000000001edad07 start_domain_em_() ???:0
3 0x0000000001a4b121 start_domain_() ???:0
4 0x0000000001489273 med_initialdata_input_() ???:0
5 0x00000000004177f2 module_wrf_top_mp_wrf_init_() ???:0
6 0x00000000004166a4 MAIN__() ???:0
7 0x0000000000416622 main() ???:0
8 0x00000000000237b3 __libc_start_main() ???:0
9 0x000000000041652e _start() ???:0
=================================

When I try and run in debug mode, I get the following:

forrtl: severe (408): fort: (3): Subscript #4 of the array RFIELD_4D has value 0 which is less than the lower bound of 1
Image PC Routine Line Source
wrf.exe 000000000B3D8EC6 Unknown Unknown Unknown
wrf.exe 000000000263FFAD module_trajectory 1097 module_trajectory.f90
wrf.exe 0000000002614EBC module_trajectory 498 module_trajectory.f90
wrf.exe 0000000003C336B2 start_domain_em_ 2344 start_em.f90
wrf.exe 00000000030ED068 start_domain_ 121 start_domain.f90
wrf.exe 00000000028A15D3 Unknown Unknown Unknown
wrf.exe 0000000000413705 module_wrf_top_mp 271 module_wrf_top.f90
wrf.exe 000000000041304F MAIN__ 23 wrf.f90
wrf.exe 0000000000412FE2 Unknown Unknown Unknown
libc-2.28.so 000015482FDC57B3 __libc_start_main Unknown Unknown
wrf.exe 0000000000412EEE Unknown Unknown Unknown

This suggests that the index table, p%index_table(n,dm), for this field (I think it's currently moist vars, but seems to also error on dyn vars) is not contiguous, but it's unclear to me why/if it's supposed to be like that...

I tried to trace this through the files, I've come up with a couple of ideas, but would love any input you might have (and apologies if I've gotten the language wrong here, I'm still learning)
1) am I missing a location that I need to make an adjustment to account for the increase to 1500 trajectories?
2) I noticed limits of 1000 in several places with IO, and wondered if there is some kind of arbitrary limit that I'm hitting...
Am I asking the right questions?

I saw one of the other posts on trajectories, and was encouraged by the statement "The developers of the trajectory code wanted the default value to be 1000 to give users a good starting point for the number of trajectories they should use ," which makes me think perhaps I've just missed something and it's possible to have more than 1000 trajectories!

Any advice/discussion would be appreciated.

Cheers,
Stacey

davegill · Feb 8, 2021

Stacey,

It sure seems like you are doing this the right way, based on the mods you made. Let's start from the beginning, and take a look at the errors in the build to see if those give us a clue.

Code:

./clean -a
./configure -D

Build the code to run on ONE processor.

Code:

./compile em_real -j 1>& foo

Build the code so that only one processor runs the build command.

Put the build log in an attachment for us to review.

MatzeG · May 7, 2021

As far as I understand it you have to change traj_max in module_trajectory to increase the maximum number of trajectories. However, it seems that the trajectory module is not suited for running a very large number of trajectories. I tried running 65000 trajectories on 512 cores and WRF got stuck somewhere inside the trajectory initialization in start_em.F for 1.5 hours until I killed the job. Unfortunately, the nodes couldn't actually kill the processes (and had to be rebooted I think). Maybe it was just a memory problem, but I don't think so. Does anybody know why these problems occur? With 7000 trajectories on 512 cores it works now.

Ming Chen · May 12, 2021

I guess it is also unrealistic to specify starting and ending dates as well as original locations for thousands of trajectories ....

MatzeG · May 25, 2021

why should that be unrealistic? The input file can easily be produced automatically, e.g. with Python...

Allocation error when number of trajectories is increased beyond 1000?

smhitch

New member

davegill

New member

MatzeG

New member

Ming Chen

Moderator

MatzeG

New member