vectorization

This is an old revision of the document!

# Vectorization

Starting in the late nineties, Intel integrated Single-Instruction-Multiple-Data (SIMD) machine instructions into their commodity CPU product line. These SIMD instructions allow to apply the same mathematical operation ( multiply, add … ) to 2, 4 or even more data values at a time. The set of data values is also called “vector” (not to be confused with an algebraic vector). The theoretical performance gain is equal to the amount of vector units a CPU can hold.

Although the SIMD instruction sets have been around in every-day CPUs for quite some time, theses extended functionalities could only be accessed by programmers willing to interleave their high-level C or C++ source code with sections of quasi-assembler instructions. Modern compilers, like the GNU Compiler Collection (GCC), now include the ability to transform regular C or C++ source code into vector operations. This process is called auto-vectorization.

Note: This page refers to the GCC. For other compilers see their respective documentation. Intel provides a good guide, which is in its major parts also applicable to other compilers.

To estimate the performance gains you can achieve with auto-vectorized code, you first need to understand the capabilities of the CPU you are compiling for and how many vectors it can process. Therefore, run the following command (on a specific host):

\$ cat /proc/cpuinfo
...
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
...

Among other CPU properties, you will find information about the SIMD capabilities. If a certain string is listed here, your CPU supports this feature:

Name Register Size Amount of floats Amount of doubles
mmx (*) 64 bit 0 0
sse2 128 bit 4 2
ssse3 128 bit 4 2
sse4_1 128 bit 4 2
sse4_2 128 bit 4 2
avx 256 bit 8 4

* the first vector units on Intel processors could hold only two itegers.

All Intel and AMD 64-bit CPUs at least support the sse2 instruction set. The AVX instruction set has been introduced in 2011 with the Sandy Bridge architecture and is available on the latest generation of Mac Book Pro laptops.

The AVX instruction set can apply mathematical operations on four double precision floating point values simultaneously. Thus the maximum performance gain your application can achieve is times 4 compared to the version which is not using the vector units.

The steady evolution of the size of the vector units is a symptom of the new trend of the microarchitecture evolution. Ten years ago, the clock frequency was almost doubling on a yearly basis. Somehow a clock frequency limit has been reached and manufacturers opted for bigger and bigger vector units and an increasing number of cores per die.

In order to distribute calculations to the vector units of the CPU, the compiler has to have a solid understanding of the dependencies and side effects of your source code. But foremost, the compiler must be able to detect sections of your source-code where the Single-Instruction-Multiple-Data paradigm can be applied. The simplest case is a loop which performs a calculation on an array:

for ( unsigned int i = 0; i < ArraySize; ++ i)
{
c[i] = a[i] * b[i];
}

Try this example yourself using one of our current gcc modules. The command line argument you have to use is -ftree-vectorize.

Note: Loops which profit the most from auto-vectorization are the ones containing mathematical operations. Loops which are merely iterating over a collection of objects won't profit and are not even auto-vectorizable in most cases.

A list of loop constructs which can be auto-vectorized can be found here.

Once you compile your code with the option -fopt-info-vec-missed[=logfile] GCC will give you a detailed report of all loops in your program and the reason why vectorization failed or why it has been sub-optimal. In contrast,-fopt-info-loop-optimized=[logfile] sums the number of vectorized loops and their positions. Both options are mutually exclusive.

To turn on auto-vectorization use

-ftree-vectorize

This option is implicitly enabled if you compile with the optimization level -O3 or more.

Some loop constructs ( e.g. reduction of float values ) can only be vectorized if the compiler is allowed to change the order of mathematical operations. To allow this, use the flag

-ffast-math

This option is also enabled if you use the optimization level -Ofast. Be aware, that the fast-math option slightly modifies the error behavior in floating point operations. For details, please see http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.html

Furthermore, you can explicitly state, which SIMD instruction set the compiler should use when generating the binary file. When compiling for the x86-64 architecture, GCC will use the SSE2 instruction set by default. If you want to enable the AVX instruction set, use the compiler flag:

-mavx

Be aware, that the machine you want to run the binary on must support the AVX instruction set. You can also let the compiler decide which instruction set is supported by the machine you are compiling on by using the compiler flag

-march=native

If you want to use C++11 features, like lambda expressions, be sure to enable the new standard.

• Use Structure-of-Arrays

Not only needs the compiler be able to analyze your loop iterations, but also the patterns in which the memory is accessed. Complex classes with nested arrays are therefore hard or impossible to auto-vectorize by the compiler. Use simple c-arrays or std::arrays ( introduced in C++11 ) to hold your data and allow the compiler to access your data in a continuous manner. This will also enable your program to take advantage of the various caches the CPU has.

If you need variable-sized arrays, use std::vector and extract a pointer to the first element to access the elements during a loop:

std::vector myData(23);
double * pMyData = &myData.front();
...
pMyData[i] += 5;
• Limit branching

Limit the branching within the for-loop to a minimum. GCC is able to translate some if statements to vectorized code, but only to a limited extend. When doing so, all possible branches are evaluated and the values of the branch not taken are discarded.

• Isolate functional parts

If you have complex calculations within the for-loop you want GCC to vectorize, consider splitting them into more simpler fragments, called kernels, by using the C++11 feature of lambda expressions. An introduction to this new C++ feature can be found here: http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2B

Here is an example of a lambda function defined to compute and return the square of a double variable.

Definition:

auto kernel_square =
[] // capture nothing
( double const& a) // take 1 parameter by reference
->double // lambda function returns the square
{
return ( a * a );
};

And the invocation of the lambda function within a loop:

for ( unsigned int i = 0; i < size; ++ i)
{
ptr_square[i] = kernel_square( ptr_x[i] ) + kernel_square( ptr_y[i] ) + kernel_square( ptr_z[i] ) ;
}

Note that this code will be auto-vectorized by GCC. The call to the lambda function will not result in the overhead associated with regular functions, as the code will be fully inlined.

• No calls to external functions

Calls to external functions, like exp(), log() etc, break the vectorization of a loop. Therefore, you have to decide if the mathematical operations within your loop are complex enough to justify a split of the loop. This means, in the first loop you calculate the parameter for the exp() call and store this result in a temporary array. If done right, GCC will auto-vectorize this loop and you can profit from the speedup. The second loop will simply perform the calls to exp() and store the result.

If the function you want to call is controlled by you, attempt to formulate this function as a C++11 lambda-expression. An introduction to this new C++ feature can be found here: http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2B

• Use plain-old integer counters in loops

Use plain-old integers counters to construct the for-loop and not the famed std::iterators, as these make it hard for GCC to analyze memory access and the properties of the the loop, like iteration count.

for (vector::iterator it = y.begin(); it != y.end(); it++)
{
(*it) += 23;
}

becomes

for (unsigned int i = 0; i < y.size(); ++i)
{
y[i] += 23;
}
• vectorization.1631624293.txt.gz