This special report sponsored by QCT discusses how the company can work with leading research and commercial organizations to lower the Total Cost of Ownership by supplying highly tuned applications that are optimized to work on leading-edge infrastructure. By reducing the time to get to a solution, more applications can be executed, or higher resolutions can be used on the same hardware. QCT also has experts that understand in detail various HPC workloads and can deliver turnkey systems that are ready to use. For customers that wish to modify source code or that develop their own applications, QCT supplied highly tuned libraries and extensive guidance on how to get the most out of your infrastructure, that not only includes servers, but networking and storage as well.
This technology guide, insideHPC Special Report Optimize Your WRF Applications, shows how to get your results faster by partnering with QCT.
Benchmark Results
The first to be measured was WRF performance across popular compilers. Among the three compilers that were used to compile WRF and the corresponding libraries with, the Intel® compiler performs best, and leads other counterparts by more than 25%. Figure 2 shows the average execution time of each computation timesteps of WRF. Intel®-compiled WRF has ~ 25% less execution time compared to the other two. Figure 2 shows these results.
Next to be investigated are the communication libraries. Figure 3 shows that with the integration of Infiniband Mvapich2 (v2.3.4) libraries decrease WRF execution time by ~ 5% as compared to the Intel® MPI (v2020 update 1) and the OpenMPI (v4.0.3) libraries.
Impact of latency
Next to be investigated was the comparison of Infiniband and Ethernet. WRF was executed on 1, 4, 8, and 12 nodes over Infiniband HDR and 10G Ethernet to examine the impact of interconnect latency on its performance. The node-to-node latency of Infiniband HDR starts from 1.01 microseconds of 1-Byte packet size, and 10G Ethernet starts from 8.7 microseconds. WRF performs three times better over Infiniband on four nodes than over Ethernet, and approximately six times better on 12 nodes. Figure 4 shows these results.
OpenMP allows different cores to share the same segment of memory. The performance of WRF is best with OMP_NUM_THREADS=4 and decreases more than 10 percent when OMP threads exceed four. The trend of increasing WRF performance is attributable to the four dips of latencies within the sockets (28 cores) as shown in Figure 5 shows a particular group of low-latency cores could improve the WRF performance. Also, WRF divides the sub-domains by the OMP Thread number. An OMP that cannot be wholly divided by 28 would result in a subdomain that needs to use cores on both sockets, which increases to core-to-core latency drastically. The decrease in performance when OMP_NUM_THREADS exceeds four shows the impact of the latency increase by crossing CPU sockets on WRF. One should take a careful arrangement of process affinity to CPU cores to avoid performance drop. Figure 6 below shows the performance as a function of the number of OMP threads.
Next, QCT further investigated the WRF performance when sub-Numa clustering (SNC) was turned on. SNC allows a single Xeon® 8280 CPU to split into two groups of cores and thus decrease to core-to-core latencies within the subnuma domain, as shown in Figure 7. QCT found turning on the SNC increases the WRF performance by 1-2 percent when OMP threads are less than 2. But the performance deteriorates drastically on threads number four because four cannot be divided evenly into 14 cores and has to run across two sub-Numa domains. The experiments show the importance of grouping the low-latency cores and avoiding the imbalanced OpenMP partitioning of WRF subdomains.
Summary of WRF Benchmarks
The performance of WRF V4.1.5 highly relies on the compiler and the latencies between processors and interconnect. The Intel® compiler shows excellent execution performance for Fortran codes. The test on interconnect fabric and protocol, as well as the communication between CPU cores, shows the impact of increased latencies on WRF execution time. QCT highly recommends using Infiniband and group the adjacent OMP threads in low-latency memory (such as cache on each CPU) to reduce the impact on intercommunication.
QCT Expertise
QCT can work with leading research and commercial organizations to lower the Total Cost of Ownership by supplying highly tuned applications that are optimized to work on leading-edge infrastructure. By reducing the time to get to a solution, more applications can be executed, or higher resolutions can be used on the same hardware. QCT also has experts that understand in detail various HPC workloads and can deliver turnkey systems that are ready to use. For customers that wish to modify source code or that develop their own applications, QCT supplied highly tuned libraries and extensive guidance on how to get the most out of your infrastructure, that not only includes servers, but networking and storage as well.
For more information on how QCT can help you to maximize your HPC environments, please visit:
https://go.qct.io/solutions/data-analytic-platform/qxsmart-hpc-dl-solution/
Over the last several weeks we’ve explored these topics surrounding WRF applications and how you can get your results faster by partnering with QCT:
Download the complete insideHPC Special Report Optimize Your WRF Applications, courtesy of QCT.