Sponsored Post
In an earlier post we talked about how Intel® VTune Amplifier provides an intuitive way to tune for CPU and GPU performance, multicore scalability, bandwidth efficiency, and so on. Intel Parallel Studio XE 2017 includes yet another equally powerful performance tool in its arsenal, Intel Advisor 2017.
Intel Advisor can help identify portions of code that could be good candidates for parallelization (both vectorization and threading). It can also help determine when it might not be appropriate to parallelize a section of code, depending on the platform, processor, and configuration it’s running on.
A valuable feature of Intel Advisor is its Roofline Analysis, which provides an intuitive and powerful representation of how to best address performance issues.
Roofline analysis asks the following:
- Is the application running optimally on the current hardware? If not, what is the most underutilized hardware resource?
- What limits performance? Is the application memory or compute bound?
- What is the right strategy to improve application performance?
The roofline model (1) reveals the gap between an application’s compute and memory bandwidth ceilings and the expected peak performance of a computing platform. Roofline analysis measures two important performance parameters: arithmetic intensity, the number of floating-point operations per byte transferred between CPU and memory, and floating-point performance, as measured in billions of floating-point operations per second (GFLOP/s).
Intel Advisor plots these actual and expected parameters as a 2D graph with Arithmetic Intensity (FLOP/byte) along the X-axis, and Floating-Point Performance (GFLOP/s) along the Y-axis. The plot is a line slanting up to the right, as performance is bound by memory access, up to a point of inflection when performance is compute bound and the line becomes parallel to the X-axis. Where a loop in an application appears in the plot indicates its performance profile.
[clickToTweet tweet=”Intel® Advisor roofline model reveals the gap between an application’s performance and its expected performance.” quote=”Intel® Advisor roofline analysis reveals the gap between an application’s performance and its expected performance.”]
The Intel Advisor roofline implementation provides even more insights than a standard roofline analysis by plotting more rooflines:
- Cache rooflines illustrate performance when all the data fits into the respective cache.
- Vector usage rooflines show the maximum achievable performance levels when vectorization is used effectively.
Applications that perform close to the floating-point peak might be bounded by the compute capabilities of the current platform. Migrating to a highly parallel platform, such as the Intel Xeon Phi™, where the compute ceiling and memory throughput are higher, should be considered. On the other hand, if the performance is well below the compute ceiling, using an approach that better utilizes the vectorization capabilities of the processor, should be considered.
Roofline analysis also exposes applications that are memory bound. To improve performance in this case consider improving the algorithm or its implementation to perform more computations per data item, or migrating to a processor with a higher memory bandwidth.
Intel Advisor also includes a vectorization analysis tool that identifies loops that will benefit most from vectorization, loops that are blocked from vectorization, and loops that could benefit by reorganizing data structures. Its deep analysis of already vectorized loops and SIMD instructions can answer critical questions such as: were the hottest loops vectorized, and if so, how efficiently was that done, and what can be improved? If loops were not vectorized, why? Vectorization Advisor displays vectorization-related diagnostics together in one place: CPU performance data, compiler diagnostics, SIMD instruction set used, source code, etc.
You can discover many important performance and design insights by combining Vectorization Advisor and Roofline analyses. Many applications miss utilizing the vectorization capabilities of the processor an application is running on. Vectorization Advisor can detect inefficient usage of SIMD instructions even in loops vectorized by the compiler. Some typical examples:
- Measured efficiency is significantly lower than ideal value for the running processor.
- The instruction set in use is lower than supported by the running hardware (for example, SSE2 on a processor supporting AVX).
- Vectorization traits detection (for example, use of gather and scatter instructions when there are better alternatives).
- Non-uniform and unaligned data accesses (Memory Access Patterns analysis).
- Partial loop vectorization, when scalar peel or remainder takes noticeable CPU time.
Other bottlenecks discovered by Vectorization Advisor are displayed in a Recommendations panel.
Working together, Intel Advisor vector parallelism optimization analysis and memory-versus-compute roofline analysis offers a powerful tool for visualizing an application’s complete current and potential performance profile on a given platform.
Intel Advisor is an integral part of both Cluster Edition and Professional Edition Intel Parallel Studio XE 2017.
Download your free 30-day trial of Intel® Parallel Studio XE
(1) Roofline modeling was first proposed by University of California at Berkeley researchers Samuel Williams, Andrew Waterman, and David Patterson in the paper Roofline: An Insightful Visual Performance Model for Multicore Architectures in 2009. Reference: http://sips.inesc-id.pt/~ilic/roofline.php and A. Ilic, F. Pratas, and L. Sousa, “Cache-aware Roofline model: Upgrading the loft“, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, Jan. 2014.
This is an interesting informercial for an Intel product, but it omits a critical point, in fact, the key factor that makes Intel Advisor possible. The roofline model on which Advisor is based was developed by Sam Williams, a computer scientist in the Computational Research Division at Lawrence Berkeley National Laboratory. He developed the model as he was pursuing his Ph.D. in computer science at UC Berkeley. Williams’ initial model, which used bound and bottleneck analysis, was applied to the Cell processor, and in 2006 the performance results were published in a well-received Computing Frontiers paper. He then began working to visualize the performance bottlenecks. After several iterations, the resulting model—dubbed “Roofline” by David Patterson, William’s thesis advisor at the University of California, Berkeley—was published in Communications of the ACM in 2009. Since then, other Berkeley Lab have added to the development of the model. They include Lenny Oliker, Terry Ligocki, Matt Cordery, Doug Doerfler and Jack Deslippe. Appropriately, Intel has acknowledged Berkeley Lab’s central role in use cases prepared by the company.
http://insidehpc.com/2017/05/boosting-manycore-code-optimization-efforts-roofline-technology/