Gather and scatter operations are used in many domains. However, to use these types of functions on an SIMD architecture creates some programming challenges. As SIMD systems are optimized to work with memory laid out in a contiguous manner. Whereas a gather operation reads elements from memory and packs them in an SIMD register, the scatter operation unpacks the data and then writes to individual memory locations.
Typical coding for this will result in the non-optimal use of the SIMD instructions on an Intel Xeon Phi coprocessor. Gathers and scatters will result in more work than when memory that is being access is laid out in a contiguous manner. More cache line misses and more pages in memory will have to be accessed.
The Intel architecture, using the Streaming SIME Extensions (SSE) and the Intel Advanced Vector Extensions (AVX), gather and scatter operations would need to be performed with the scalar loads and stores. AVX2 and the Intel Initial Many Core (IMCI) instructions can also be used.
An example of this use is within the molecular dynamics domain. N-body simulations may use scatter and gather techniques to optimize the compute intensive portions of the applications. Using a number of the techniques mentioned below, a performance gain of 2X was observed on the miniMD application using the Intel Xeon processors or the Intel Xeon Phi coprocessors.
A number of optimization techniques can be used for improving gather and scatter operations.
- Improve the temporal and special locality
- Choosing the right data layout, Structure of Arrays or Arrays of Structures.
- Transposition between AoS and SoA.
- Amortize the costs of gatter/scatter.
Source: Intel Corporation, UK and Intel Corporation, USA