- 
                Notifications
    You must be signed in to change notification settings 
- Fork 434
Description
Summary of Problem
This report follows a discussion that we started on Gitter. This concerns a potential performance issue, possibly related to caching issue. To recap, I have two independent programs prog1 and prog2, and I want to measure their execution time successively and independently within the same main. To do so, I have the following code structure:
proc main()
{
  {
    var t1: stopwatch;
    t1.start();
    prog1();
    t1.stop();
    writeln("t1 = ", t1.elapsed());
  }
  {
    var t2: stopwatch;
    t2.start();
    prog2();
    t2.stop();
    writeln("t2 = ", t2.elapsed());
  }
  return 0;
}What happens to me is that when I execute both blocks (as shown above) the execution time corresponding to the first block is 2x/3x larger that when executed alone (commenting the second one). It is worth to note that I'm using (and would like to keep) Chapel 2.1.0 for this code, and that both programs involve a single CPU task that performs computation on a GPU device.
The full real code for this is attached to this report (in .txt because .chpl is not accepted). I didn't succeed to implement a simpler reproducer for this, but the code should be relatively easy to understand. On a system equipped with AMD EPYC 7513 (Zen 3), x86_64 and a Nvidia A100-SXM4-40GB (40 GiB), this gives me (in seconds):
t1 = 14.8412
t2 = 14.0312
when both blocks are executed, and t1 = 4.79889 when the second one is commented.
Programs consist of two versions of a GPU-accelerated N-Queens solver, in which tree nodes are managed in a pool data structure and lots of data exchanges occur between CPU and GPU. In prog1, the arrays are (de)allocated at each iteration, while in prog2 I use class wrappers in order to create "permanent" arrays on the GPU memory (inspired by https://github.com/chapel-lang/chapel/blob/main/test/gpu/native/basics/outOfOnArr.chpl). @bradcray suggested a third version using Chapel's on gpuLocale var …; to allocate memory independently of iterations/scopes, but this procudes segfault using Chapel 2.1.0.
Is this issue currently blocking your progress?
No
Steps to Reproduce
Source Code:
The code is given in the attached file.
Compile command:
chpl nqueensGpu.chpl -o nqueensGpu.out --fast
--fast optimization flag enabled?
'yes'
Execution command:
./nqueensGpu.out
Configuration Information
- Output of chpl --version: 2.1.0
- Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia *
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
- Back-end compiler and version, e.g. gcc --versionorclang --version:gcc (Spack GCC) 12.2.0
- (For Cray systems only) Output of module list: