This special report sponsored by QCT discusses how the company can work with leading research and commercial organizations to lower the Total Cost of Ownership by supplying highly tuned applications that are optimized to work on leading-edge infrastructure. By reducing the time to get to a solution, more applications can be executed, or higher resolutions can be used on the same hardware. QCT also has experts that understand in detail various HPC workloads and can deliver turnkey systems that are ready to use. For customers that wish to modify source code or that develop their own applications, QCT supplied highly tuned libraries and extensive guidance on how to get the most out of your infrastructure, that not only includes servers, but networking and storage as well.
This technology guide, insideHPC Special Report Optimize Your WRF Applications, shows how to get your results faster by partnering with QCT.
Introduction to WRF
WRF is a regional weather model with users ranging from researchers to forecasters all over the globe. Noted for being a mature and sophisticated model for weather research, WRF produces initial weather conditions for environmental models, such as air quality models, small-scale Large Eddy Simulation (LES) models, and disaster assessment models. WRF is among one of the significant workloads in major High- Performance Computing (HPC) systems, thus understanding how WRF performs and behaves under different optimizations could increase the HPC efficiency and thus reduce operating costs.
Similar to other weather and climate models, WRF discretizes the target simulated area into three- dimensional grids. The physics properties of each grid are then dispatched to computational threads to calculate their tendencies (the rate of change of the physical properties in the time step). After each time step is finished, the computed results will propagate to the corresponding grids both horizontally and vertically, depending on the calculated direction of the wind. WRF is highly parallelized and takes advantage of the distributed-memory method using MPICH, the shared-memory method using OPENMP, or the combination of both techniques, a hybrid approach.
Characteristics of the workload
Because of the grid approach and the parallelization of WRF, there is a large amount of data that is transferred between grids after each time step is completed. Thus, the overall performance is dependent on the high memory bandwidth and low latency of the interconnecting network. The output, which is a massive list of variables from all the grids, requires high efficiency storage bandwidth. QCT investigated the WRF performance impact from the latencies and bandwidth from both inside the processors and the chosen interconnect.
Benchmark settings
QCT ran the WRF benchmarks on a total of three QuantaPlex T42D-2U servers. Each T42D-2U server consists of four dual-socket computing nodes in a 2U form factor.
In total, twelve nodes were used to evaluate the scalability of WRF performance. Each node consists of two Second-Generation Intel® Xeon® 8280 Scalable Processors (28 cores at 2.7Ghz base frequency) and 384GB DDR-4 2933 memory on each node, which results in a total of 56 cores and 384GB of memory in one node, or 224 cores and 1296GB of memory in each T42D-2U system. Each node connects with other computing nodes and storage nodes with 10 Gbits/s Ethernet and Infiniband HDR-100 100 Gbit/s networks. The BeeGFS parallel file system is used as the underlying file system to maximize storage throughput. The hardware specification is listed in the inset image.
WRF settings
QCT used WRF V4.1.5 for the benchmark investigation. QCT followed Kyle (2018)’s work and created a new CONUS 2.5km domain for version 4 of WRF. The Conus 2.5km domain as shown in Figure 1 below, consists of 1901 x 1301 grid points and 40 vertical layers. The results were measured by the averaged WRF-output computation time of each time step. Also, the output benchmark was measured by the averaged WRF-output computation time of each output time step.
Compiler Options
QCT used three different compilers with the latest version available to compile WRF and its dependent libraries (OpenMP/Mvapich, NetCDF, HDF5). The three compilers tested were GNU compiler v 9.2.0, AOCC compiler version 2.1.0 by AMD, and Intel® FORTRAN compiler (part of Intel® composer XE version 2020). The compiler flags other than the default WRF settings are listed below:
- GNU compiler version 9.2.0 (gcc): -O3 (the default is -O2).
- AOCC compiler version 2.1.0 (aocc): -O3. Adapted from the WRF default GNU compiler setting to CLANG/FLANG settings, and change -O2 to -O3. -Mbytwswapio. Ensure the endianness of WRF input/output
- Intel® compiler version 19.1.1 (v2020) (ifort): -xCORE-AVX512 (or -Xhost AVX512).
Optimized for Cascade Lake Xeon® 8280, utilizing the full 512-bit SIMD instruction set.
Over the next few weeks we will explore these topics surrounding WRF applications and how you can get your results faster by partnering with QCT:
- Introduction, Servers and More, Understanding Climate Change
- Introduction to WRF
- Benchmark Results
Download the complete insideHPC Special Report Optimize Your WRF Applications, courtesy of QCT.