Sponsored Post
The OpenMP* API for shared memory parallelism has been around for two decades. Back in 1997, the first version of the OpenMP (Open Multi-Processing) specifications attempted to provide an easy way to bring shared-memory parallelism to Fortran programs. Up to then, programmers had to explicitly use a threading model like pthreads, or a distributed memory framework like MPI. Both involved restructuring an application to fit the model, and then adding runtime library calls to implement parallelism. And both approaches were difficult to program and time consuming, and meant maintaining a sequential version and various platform-dependent parallelized versions.
OpenMP took a different approach. It relied on pragmas — source code directives the programmer adds to give the compiler clues about loops and how they could be parallelized. The OpenMP designers realized that the programmer usually knew a lot more about the program than an always-cautious compiler could discover from its own static analysis of the source code. So by making the directives simple and flexible, programmers could start thinking in parallel terms and rely on the compiler to sweat the implementation details. With OpenMP, the programmer could be assured the same program would run on any compiler that implemented the OpenMP pragmas. Or, the program ran sequentially, eliminating the need to maintain both sequential and parallel versions of the same code.
This approach turned out to be very fortunate, even though in 1997, the designers, an ever growing consortium of vendors and researchers, didn’t know it at the time. Back then, multi-processor systems were limited to two or four CPUs sharing memory. But by 2008, multicore processors were starting to dominate the field. Today, multicore and many-core processors with SIMD vector units have turned OpenMP into a must-have tool for achieving high performance on commodity off-the-shelf (COTS) hardware as well as supercomputers.
The most recent versions of the OpenMP API include pragmas for tasking, SIMD programming, and offloading to accelerators and GPUs.
OpenMP is also a primary example of how hardware and software vendors, researchers, and academia, volunteering to work together, can successfully design a standard that benefits the entire developer community. Today, most software vendors track OpenMP advances closely and have implemented the latest API features in their compilers and tools. With OpenMP, application portability is assured across the latest multicore systems, including Intel Xeon PhiTM processors.
Still, if industrial-strength HPC applications are not fully parallelized and vectorized, they may actually run slower on the latest supercomputers. So to achieve the high performance potential of these platforms, it is imperative to find inefficiencies and fix them.
OpenMP enables various parallelization strategies and opportunities. But is the strategy you are using giving the best performance, or would a different strategy do better? To find out, you need tools that can relate performance to the OpenMP constructs in the code, show where the parallel and sequential time is spent, and display ideal versus measured CPU utilization in parallel regions. With this kind of information, the programmer can discover where tuning the code will result in the biggest gain.[clickToTweet tweet=”Intel Parallel Studio XE is the right environment for developing multicore applications with OpenMP.” quote=”Intel Parallel Studio XE is the right environment for developing multicore applications with OpenMP.”]
Intel® VTuneTM Amplifier XE, along with Intel Advisor, both part of Intel Parallel Studio XE, provides the tools you need to completely tune industrial-strength OpenMP codes. Starting from general CPU utilization you can discover where the code is purely serial. The potential gain metric directs the tuning focus to the parallel regions with the best potential gain. Intel VTune displays measured OpenMP performance metrics, such as overhead and waiting time, to help programmers discover the root cause of performance problems, such as load imbalance, non-optimal granularity, and memory latency. Intel Advisor can offer hints for improving loop vectorization to ensure that the full performance potential of the underlying hardware is being utilized.
Working together with Intel C, C++, and Fortran compilers, Intel VTune Amplifier and the analysis tools that comprise Intel Parallel Studio XE 2017, you have the best working environment for developing OpenMP programs that successfully utilize the full potential of today’s processors.
Download your free 30-day trial of Intel® Parallel Studio XE 2017.