Added capabilities ease parallel programming across heterogeneous architectures
In decades past, the notion of “one workload, one processor” was the conventional wisdom. Today though, that legacy perception falls to the wayside with the rise of heterogeneous computing. The latest technologies can now provide the exceptional speed and scalability levels necessary to usher in an era of high performance computing (HPC), artificial intelligence (AI), advanced rendering, and Internet of Things (IoT) systems. While this evolution creates enormous opportunities across the industry, it also comes with application development challenges. The industry is responding with the oneAPI initiative to facilitate software development productivity while delivering performance across different architectures.
Data-centric workloads and applications have continued a path of diversification in recent years. Different hardware choices have separate silos of tools and development languages. Proprietary solutions and the variety of development languages limit the amount of “reusable” code. Data Parallel C++ (DPC++) offers an evolution of ISO C++ to help alleviate that issue. It will increase coding productivity and change the paradigm for programming across diverse architectures. oneAPI, with DPC++ (its primary language), will make the process much easier.
Open specifications are critical. Without them, developers cannot predict a language’s evolution over time and its future compatibility. Therefore, innovation requires open development. When “walled gardens” are in place, innovation stagnates. The walls also restrict performance across architectures. C++ is a broadly adopted and robust language. However, it has some limitations since it has been defined primarily with only CPUs in mind. Today’s standard cannot accommodate heterogeneous programming without additions.
The Khronos Group led the innovation of OpenCL™, enabling low-level and detailed programming across multiple architectures. Over time, though, members of the OpenCL™ consortium developed their own extensions—the standard effectively diverging—and proprietary languages held sway. Based on the OpenCL™ experience, the Khronos Group innovated to simplify the process further with the SYCL programming model. Like OpenCL™, it supports heterogeneous programming, but uses standard C++ constructs with single-source host and accelerator code.
“Accelerated computing has diversified over the past several years given advances in CPU, GPU, FPGA, and AI technologies,” said Joe Curley, senior director for oneAPI products at Intel. “This innovation drives the need for an open and cross-platform language that allows developers to realize the potential of new hardware, minimizes development cost and complexity, and maximizes reuse of their software investments.”
Today, the Khronos Group’s SYCL offers open source implementations supporting hardware vendors like AMD, ARM Mali, NVIDIA, and Intel.
Easing programming for heterogeneous architectures
DPC++, a cross-architecture language that is part of the industry’s oneAPI initiative, complements existing efforts to ease the heterogeneity challenge. DPC++ combines three elements—C++, SYCL, and extensions—to facilitate cross-architecture systems. It also serves as a standards-based language implementation that opens the door to introduce new features to future revisions of the SYCL specification. Intel, as a key participant of that effort, is also building implementations of oneAPI.
Joe Curley reinforced the commitment to openness in the language:
“DPC++ builds on open standards: C++ and SYCL. But the language also provides the ability to rapidly experiment and innovate through extensions, develop them, and establish a virtuous cycle into open standards bodies — like Khronos SYCL. DPC++ provides an open mechanism for the developers to evolve data parallel programming rapidly.”
DPC++ adoption is gaining steam already, as illustrated by Codeplay’s recently announced DPC++ compiler for Nvidia GPUs.
New DPC++ extensions unleash SYCL and C++
Although it is still in beta, DPC++ offers nearly 30 new extensions to augment tools currently available. The provisional SYCL2020 specification incorporates many of them:
- Unified Shared Memory (USM) defines pointer-based memory accesses and management interfaces. It provides the ability to create allocations that are visible and have consistent pointer values across both host and device(s). Different USM capability levels are defined, corresponding to varying degrees of device and implementation support.
- In-order queues define semantics for queues to streamline common coding patterns.
- Optional lambda name eliminates the need to name lambdas that define kernels manually. It also simplifies coding and enables composability with libraries. Users also have an option for manually named lambdas in scenarios like debugging or interfacing with a sycl::program object.
- Deduction guides simplify common code patterns and reduce code verbosity and length by enabling Class Template Argument Deduction (CTAD) from modern C++.
- Reductions improve productivity with a common reduction pattern without explicit coding. Building them into the language enables optimized implementations to exist for combinations of device, runtime, and reduction properties.
- Sub-group(s) define a work-item grouping within a work group. The process of synchronizing work items in the subset can occur independently of work items in other sub-group(s). At the same time, the sub-group(s) that commonly map to SIMD hardware expose communication operations across work items in the group.
- Sub-group algorithms define the collective operations across work items in a subgroup that are available only for sub-group(s). They also enable algorithms from the more generic “group algorithms” extension as subgroup aggregate operations.
- Enqueued barriers ease dependence creation and tracking for some common programming patterns. This benefit allows coarser-grained synchronization within a queue without the need for manual creation of fine-grained dependencies.
- Extended atomics offers atomic operations aligned with C++20, including support for floating-point types and shorthand operators.
- Group algorithms define collective operations that cross groups of work items, including broadcast, reduce, and scan. They streamline productivity with algorithms that do not need explicit coding and also allow optimized implementations to exist for combinations of device and runtime.
- Group mask defines a type that can represent a set of work items from a group and also collective operations that create or operate on that type, such as ballot and count.
- Restrict all arguments defines an attribute that can apply to kernels (including lambda definitions of kernels), which signals that there will be no memory aliasing between any pointer arguments that are passed to or captured by a kernel. When the developer knows more about the kernel arguments than a compiler can infer or safely assume, this optimization attribute is most beneficial.
- Relaxed data layout removes the requirement of C++ standard layout types for data shared by a host and device(s). It requires device compilers to validate layout compatibility too.
- Queue shortcuts define kernel invocation functions directly on the queue classes. When dependencies and accessors do not need creation within the additional command group scope, queue shortcuts simplify code patterns.
- Required work-group size is an attribute that enables optimizations based on additional user-driven information. It defines a kernel-applied attribute, including lambda definitions of kernels, which signal invocation of the kernel with a specific work-group size.
- Data flow pipes enable efficient first-in, first-out (FIFO) communication in DPC++ for a mechanism commonly used when describing algorithms for spatial architectures such as FPGAs.
“We’ve been working closely with Intel on defining oneAPI and using oneAPI for our own internal development and testing,” explained Hal Finkel, Lead for Compiler Technology and Programming Languages at Argonne National Laboratory’s Leadership Computing Facility. “oneAPI provides extended capabilities, such as supporting unified memory and reductions, above what is available in the current SYCL 1.2.1 specification, and these capabilities are essential for us. Our development of a Kokkos backend for DPC++/oneAPI, for example, relies on these additional features. We’re looking forward to updates to the SYCL specification, which we trust will contain important new features from DPC++ that address specific needs identified during these development activities.”
Added Joe Curley, “In only a few months, the DPC++ community has made enormous progress in both language design, architecture, and implementation. We encourage the community to join in the effort to open accelerated programming.”
To get started using DPC++, developers can access the language and APIs in two ways: Intel® DevCloud, and Intel® reference implementation (Intel® oneAPI toolkits).
Notices and Disclaimers
Intel compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel® microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel® microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product user and reference guides for more information regarding the specific instruction sets covered by this notice.
Intel technologies may require enabled hardware, software, or service activation. No product or component can be absolutely secure.
Your costs and results may vary.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Intel disclaims all express and implied warranties, including, without limitation, the implied warranties of merchantability, fitness for a particular purpose, and noninfringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
The products described may contain design defects or errors known as errata that may cause the product to deviate from published specifications. Current characterized errata are available on request.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Other names and brands may be claimed as the property of others.