Skip Navigation

Profilers on Ookami

There are several profiling tools available on the system


You can find the documentation here. gprof collects information of a program during runtime. To profile your program use the ‘ -pg ’ option for compiling and linking.

cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg

Run the program as usual ./myprog. The program will run as usual and produce its usual output. It might be a little bit slower due to the time spent collecting and writing the profile data.

The profile are written in a file called gmon.out which is located in the same directory as where you executed your program. Your program must exit normally for this file to be written. 

Now you can run  gprof  to interpret the information. The  gprof program prints a flat profile and a call graph on standard output. Optionally you can pipe the output to a file and save it.

gprof ./myprog > gprof_output.txt


You can find the documentation here. perf  is a simple command line tool profiler. No special compiler flags have to be used for using perf.

The  perf  tool supports a list of measurable events. Run perf list to generate the full list of events that are allowed by the -e flag.  Here is an example of how to output the duration and the cpu-cycles of a program

perf stat -e duration_time -e cpu-cycles ./myprog

Arm forge tools

To use these tools the appropriate modules have to be loaded

module load arm-modules/21
module load forge/21.0.1

Ookami has a limited number of licences for the Arm toolchain. It's possible to get an error when compiling with arm or using the forge tools, saying that there are no available licences. If this happens, you have to wait until another user does not need his licence anymore.


Documentation can be found here.

Arm MAP is a source-level profiler and can show how much time was spent on each line of code. 

Compile your code with -g to enable source code line details. The profile can be generated either by running the program

map --profile ./a.out

This will produce a .map file which can be loaded into the GUI

Or you can generate the profile directly via the GUI. Make sure X11 forwarding is allowed (ssh -X). MAP can be started by the simple command map

This will open the GUI and allows you to either load a profile or to profile a program.



Documentation can be found here

perf-report takes samples of the program at a given intervall. Arm recommends to take at least 1000 samples. Hence choose your example such that the program runs long enough. Running your executable with

perf-report ./a.out

will produce a .html file which can be opened in a browser and provides high level analysis of the program (e.g. how much time is spent in compute, MPI and I/O). Also a .txt is produced, which gives the same information and can be opened in an editor.

Cray tools

Cray has various profiling tools in its toolsuite. Start by loading cray

module load CPE



The documentation can be found here.

PAPI provides a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. In addition, PAPI provides access to a collection of components that expose performance measurement opportunites across the hardware and software stack.



Documentation can be found here.

PerfTools-lite is a simplified, easy-to-use version of the Cray Performance Measurement and Analysis Tool set. It provides basic performance analysis information automatically.



Documentation can be found here.

CrayPat allows a user to re/instrument compiled binaries (executable files) and select aspects to specifically profile, including items such as MPI and OpenMP API’s, shared memory, and a/synchronous I/O.



Documentation can be found here.

Cray Reveal utilizes the Cray CCE program library  for source code analysis, combined with performance data collected from CrayPat. Reveal helps to identify top time-consuming loops, with compiler feedback on dependency and vectorization.


Documentation can be found here.

TAU Performance System ®  is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, Python. You can find recordings of a TAU webinar on Ookami in our recordings section.


Documentation can be found here.

Likwid is a simple to install and use toolsuite of command line applications and a library for performance oriented programmers. It can be loaded via

module load likwid/5.1.1


Documentation can be found here.

For an innermost loop kernel in assembly, this tool allows automatic instruction fetching of assembly code and automatic runtime prediction including throughput analysis and detection for critical path and loop-carried dependencies. It can be loaded via

module load archiconda/3
source activate OSACA


Documentation can be found here.

To use this you probably have to swith off the PCP hardware counter collection. 
You can do this by running the perfalloc command