Skip Navigation
Search

Vectorization for different compilers

Compilers can do vectorization when setting the right flags. Here we are showing two examples of code compiled with the Cray, Arm and gnu compiler.

Simple math functions

The investigated functions are: Simple (Y = 2 X + 3 X 2), Reciprocal, Square root, Exponential, Sin, Power function. Those are compiled using three different compilers, Cray, Arm and GNU. The compiler specific vectorization flags are turned on. 

void Xsimple(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = 2.0*x[i] + 3.0*x[i]*x[i];
}

void Xrecip(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = 1.0/x[i];
}

void Xsqrt(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::sqrt(x[i]);
}

void Xexp(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::exp(x[i]);
}

void Xsin(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::sin(x[i]);
}

void Xpow(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::pow(x[i],0.55);
}


Below you can find the compiler versions and flags (for vectorization and vectorization reports) used for this example

Cray
module load CPE
Cray C++ : Version 10.0.2

with the compiler flags

-O3 -h aggress,flex_mp=tolerant,msgs,negmsgs,vector3,omp

Flag description:

O3
Optimization level 3

aggress
Provides greater opportunity to optimize loops that would otherwise by inhibited from optimization due to an internal compiler size limitation.

flex_mp=tolerant
Controls the aggressiveness of optimizations which may affect floating point and complex repeatability when application requirements require identical results whenvarying the number of ranks or threads. Tolerant uses most aggressive optimization and yields highest performance, but results may not be sufficiently repeatable for some applications

msgs
Causes the compiler to write optimization messages to  stderr.

negmsgs
Causes the compiler to generate messages to stderr that  indicate why optimizations such as vectorization or inlining did not occur in a given instance.

vector3
Specifies the level of automatic vectorizing to be performed. Vectorization results in dramatic performance improvements with a small increase in object code size. Vectorization directives are unaffected by this option. 3  specifies aggressive vectorization.

omp
OMP support
Arm

module load arm-modules/21
Arm C/C++/Fortran Compiler version 21.0

with the compiler flags

-Ofast -ffp-contract=fast -Wall -Rpass=loop-vectorize -march=armv8.2-a+sve -mcpu=a64fx
                     -armpl -fopenmp
                  

Flag description:

Ofast
 Enables  all the optimizations from level 3 including those performed with the -ffp-mode=fast armclang option. This level also performs other aggressive optimizations that might violate strict compliance with language standards. -Ofast implies -ffast-math.

ffp-contract=fast
If you set -ffp-contract=fast fused floating-point contractions are always used and the compiler ignores the 'STDC FP_CONTRACT' pragma setting. 

Wall
Enable all warnings.

Rpass=loop-vectorize
Enable vectorization report

march=armv8.2-a+sve
Specifies architecture and extensions.

mcpu=a64fx
Select CPU architecture.

armpl
Use the 'Generic' SVE library from Arm Performance Libraries.

fopenmp
Enable OpenMP
GNU

module load gcc/10.3.0
gcc (GCC) 10.3.0

with the compiler flags

-Ofast -Wall -mtune=a64fx -mcpu=a64fx -march=armv8.2-a+sve -fopt-info-vec -fopenmp

Flag description:

Ofast, Wall, mcpu=a64fx, march=armv8.2-a+sve
see descriptions above

mtune=a64fx
Tune to cpu-type

fopt-info-vec
Output vectorization report.

Fujitsu

module load fujitsu/compiler/4.5
FCC (FCC) 4.5.0 20210304

with the compiler flags

 -Kfast -KSVE -Koptmsg=2 

Flag description:

Kfast
Optimization

KSVE
Vectorization

Koptmsg=2
Output vectorization report.

When compiling the compiler output suggests that Fujitsu, Cray and Arm vectorize all functions, whereas GNU can't vectorize exp, sin and pow.

  Fujitsu Cray Arm GNU
Simple (Y = 2 X + 3 X 2) Yes Yes Yes Yes
Reciprocal Yes Yes Yes Yes
Square root Yes Yes Yes Yes
Exponential Yes Yes Yes  
Sin Yes Yes Yes  
Power Yes Yes Yes  

However, looking at the runtimes of the functions gives a more complex picture (see Figure 1). The Fujitsu and cray compilers vectorizes everything as claimed. The arm compiler claims to vectorize all functions. It is doing this but some functions are not vectorized in the most efficient way. The code assembly shows that it uses  the DIV and SQRT functions rather than the more efficient Netwon algorithm. And the gnu compiler just vectorizes the simple function. The recip and sqrt are not vectorized as expected from the compiler output. 

Runtimes

Figure 1: Runtimes of the simple math functions for different compilers.

The Fujitsu compiler gives the best results. To highlight the differences in runtime Figure 2 shows the runtimes normalized to the Fujitsu compiler.

Runtimes_Fujitsu

Figure 2: Runtimes of the simple math functions for different compilers scaled to the Fujitsu compiler.