1 =============================
2 Offloading Design & Internals
3 =============================
11 This document describes the Clang driver and code generation steps for creating
12 offloading applications. Clang supports offloading to various architectures
13 using programming models like CUDA, HIP, and OpenMP. The purpose of this
14 document is to illustrate the steps necessary to create an offloading
15 application using Clang.
20 Clang supports OpenMP target offloading to several different architectures such
21 as NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated by
22 Clang and then executed using the ``libomptarget`` runtime and the associated
23 plugin for the target architecture, e.g. ``libomptarget.rtl.cuda``. This section
24 describes the steps necessary to create a functioning device image that can be
25 loaded by the OpenMP runtime. More information on the OpenMP runtimes can be
26 found at the `OpenMP documentation page <https://openmp.llvm.org>`__.
28 .. _Offloading Overview:
33 The goal of offloading compilation is to create an executable device image that
34 can be run on the target device. OpenMP offloading creates executable images by
35 compiling the input file for both the host and the target device. The output
36 from the device phase then needs to be embedded into the host to create a fat
37 object. A special tool then needs to extract the device code from the fat
38 objects, run the device linking step, and embed the final image in a symbol the
39 host runtime library can use to register the library and access the symbols on
45 The compiler performs the following high-level actions to generate OpenMP
48 * Compile the input file for the host to produce a bitcode file. Lower ``#pragma
49 omp target`` declarations to :ref:`offloading entries <Generating Offloading
50 Entries>` and create metadata to indicate which entries are on the device.
51 * Compile the input file for the target :ref:`device <Device Compilation>` using
52 the :ref:`offloading entry <Generating Offloading Entries>` metadata created
54 * Link the OpenMP device runtime library and run the backend to create a device
56 * Run the backend on the host bitcode file and create a :ref:`fat object file
57 <Creating Fat Objects>` using the device object file.
58 * Pass the fat object file to the :ref:`linker wrapper tool <Device Linking>`
59 and extract the device objects. Run the device linking action on the extracted
61 * :ref:`Wrap <Device Binary Wrapping>` the :ref:`device images <Device linking>`
62 and :ref:`offload entries <Generating Offloading Entries>` in a symbol that
63 can be accessed by the host.
64 * Add the :ref:`wrapped binary <Device Binary Wrapping>` to the linker input and
65 run the host linking action. Link with ``libomptarget`` to register and
68 .. _Generating Offloading Entries:
70 Generating Offloading Entries
71 -----------------------------
73 The first step in compilation is to generate offloading entries for the host.
74 This information is used to identify function kernels or global values that will
75 be provided by the device. Blocks contained in a ``#pragma omp target`` or
76 symbols inside a ``#pragma omp declare target`` directive will have offloading
77 entries generated. The following table shows the :ref:`offload entry structure
78 <table-tgt_offload_entry_structure>`.
80 .. table:: __tgt_offload_entry Structure
81 :name: table-tgt_offload_entry_structure
83 +---------+------------+------------------------------------------------------------------------+
84 | Type | Identifier | Description |
85 +=========+============+========================================================================+
86 | void* | addr | Address of global symbol within device image (function or global) |
87 +---------+------------+------------------------------------------------------------------------+
88 | char* | name | Name of the symbol |
89 +---------+------------+------------------------------------------------------------------------+
90 | size_t | size | Size of the entry info (0 if it is a function) |
91 +---------+------------+------------------------------------------------------------------------+
92 | int32_t | flags | Flags associated with the entry (see :ref:`table-offload_entry_flags`) |
93 +---------+------------+------------------------------------------------------------------------+
94 | int32_t | reserved | Reserved, to be used by the runtime library. |
95 +---------+------------+------------------------------------------------------------------------+
97 The address of the global symbol will be set to the device pointer value by the
98 runtime once the device image is loaded. The flags are set to indicate the
99 handling required for the offloading entry. If the offloading entry is an entry
100 to a target region it can have one of the following :ref:`entry flags
101 <table-offload_entry_flags>`.
103 .. table:: Target Region Entry Flags
104 :name: table-offload_entry_flags
106 +----------------------------------+-------+-----------------------------------------+
107 | Name | Value | Description |
108 +==================================+=======+=========================================+
109 | OMPTargetRegionEntryTargetRegion | 0x00 | Mark the entry as generic target region |
110 +----------------------------------+-------+-----------------------------------------+
111 | OMPTargetRegionEntryCtor | 0x02 | Mark the entry as a global constructor |
112 +----------------------------------+-------+-----------------------------------------+
113 | OMPTargetRegionEntryDtor | 0x04 | Mark the entry as a global destructor |
114 +----------------------------------+-------+-----------------------------------------+
116 If the offloading entry is a global variable, indicated by a non-zero size, it
117 will instead have one of the following :ref:`global
118 <table-offload_global_flags>` flags.
120 .. table:: Target Region Global
121 :name: table-offload_global_flags
123 +-----------------------------+-------+---------------------------------------------------------------+
124 | Name | Value | Description |
125 +=============================+=======+===============================================================+
126 | OMPTargetGlobalVarEntryTo | 0x00 | Mark the entry as a 'to' attribute (w.r.t. the to clause) |
127 +-----------------------------+-------+---------------------------------------------------------------+
128 | OMPTargetGlobalVarEntryLink | 0x01 | Mark the entry as a 'link' attribute (w.r.t. the link clause) |
129 +-----------------------------+-------+---------------------------------------------------------------+
131 The target offload entries are used by the runtime to access the device kernels
132 and globals that will be provided by the final device image. Each offloading
133 entry is set to use the ``omp_offloading_entries`` section. When the final
134 application is created the linker will provide the
135 ``__start_omp_offloading_entries`` and ``__stop_omp_offloading_entries`` symbols
136 which are used to create the :ref:`final image <Device Binary Wrapping>`.
138 This information is used by the device compilation stage to determine which
139 symbols need to be exported from the device. We use the ``omp_offload.info``
140 metadata node to pass this information device compilation stage.
142 Accessing Entries on the Device
143 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
145 Accessing the entries in the device is done using the address field in the
146 :ref:`offload entry<table-tgt_offload_entry_structure>`. The runtime will set
147 the address to the pointer associated with the device image during runtime
148 initialization. This is used to call the corresponding kernel function when
149 entering a ``#pragma omp target`` region. For variables, the runtime maintains a
150 table mapping host pointers to device pointers. Global variables inside a
151 ``#pragma omp target declare`` directive are first initialized to the host's
152 address. Once the device address is initialized we insert it into the table to
153 map the host address to the device address.
155 Debugging Information
156 ^^^^^^^^^^^^^^^^^^^^^
158 We generate structures to hold debugging information that is passed to
159 ``libomptarget``. This allows the front-end to generate information the runtime
160 library uses for more informative error messages. This is done using the
161 standard :ref:`identifier structure <table-ident_t_structure>` used in
162 ``libomp`` and ``libomptarget``. This is used to pass information and source
163 locations to the runtime.
165 .. table:: ident_t Structure
166 :name: table-ident_t_structure
168 +---------+------------+-----------------------------------------------------------------------------+
169 | Type | Identifier | Description |
170 +=========+============+=============================================================================+
171 | int32_t | reserved | Reserved, to be used by the runtime library. |
172 +---------+------------+-----------------------------------------------------------------------------+
173 | int32_t | flags | Flags used to indicate some features, mostly unused. |
174 +---------+------------+-----------------------------------------------------------------------------+
175 | int32_t | reserved | Reserved, to be used by the runtime library. |
176 +---------+------------+-----------------------------------------------------------------------------+
177 | int32_t | reserved | Reserved, to be used by the runtime library. |
178 +---------+------------+-----------------------------------------------------------------------------+
179 | char* | psource | Program source information, stored as ";filename;function;line;column;;\\0" |
180 +---------+------------+-----------------------------------------------------------------------------+
182 If debugging information is enabled, we will also create strings to indicate the
183 names and declarations of variables mapped in target regions. These have the
184 same format as the source location in the :ref:`identifier structure
185 <table-ident_t_structure>`, but the function name is replaced with the variable
188 .. _Device Compilation:
190 Offload Device Compilation
191 --------------------------
193 The input file is compiled for each active device toolchain. The device
194 compilation stage is performed differently from the host stage. Namely, we do
195 not generate any offloading entries. This is set by passing the
196 ``-fopenmp-is-target-device`` flag to the front-end. We use the host bitcode to
197 determine which symbols to export from the device. The bitcode file is passed in
198 from the previous stage using the ``-fopenmp-host-ir-file-path`` flag.
199 Compilation is otherwise performed as it would be for any other target triple.
201 When compiling for the OpenMP device, we set the visibility of all device
202 symbols to be ``protected`` by default. This improves performance and prevents a
203 class of errors where a symbol in the target device could preempt a host
206 The OpenMP runtime library is linked in during compilation to provide the
207 implementations for standard OpenMP functionality. For GPU targets this is done
208 by linking in a special bitcode library during compilation, (e.g.
209 ``libomptarget-nvptx64-sm_70.bc``) using the ``-mlink-builtin-bitcode`` flag.
210 Other device libraries, such as CUDA's libdevice, are also linked this way. If
211 the target is a standard architecture with an existing ``libomp``
212 implementation, that will be linked instead. Finally, device tools are used to
213 create a relocatable device object file that can be embedded in the host.
215 .. _Creating Fat Objects:
220 A fat binary is a binary file that contains information intended for another
221 device. We create a fat object by embedding the output of the device compilation
222 stage into the host as a named section. The output from the device compilation
223 is passed to the host backend using the ``-fembed-offload-object`` flag. This
224 embeds the device image into the ``.llvm.offloading`` section using a special
225 binary format that behaves like a string map. This binary format is used to
226 bundle metadata about the image so the linker can associate the proper device
227 linking action with the image. Each device image will start with the magic bytes
232 @llvm.embedded.object = private constant [1 x i8] c"\00", section ".llvm.offloading"
234 The device code will then be placed in the corresponding section one the backend
235 is run on the host, creating a fat object. Using fat objects allows us to treat
236 offloading objects as standard host objects. The final object file should
237 contain the following :ref:`offloading sections <table-offloading_sections>`. We
238 will use this information when :ref:`Device Linking`.
240 .. table:: Offloading Sections
241 :name: table-offloading_sections
243 +----------------------------------+------------------------------------------------------------------------------+
244 | Section | Description |
245 +==================================+==============================================================================+
246 | omp_offloading_entries | Offloading entry information (see :ref:`table-tgt_offload_entry_structure`) |
247 +----------------------------------+------------------------------------------------------------------------------+
248 | .llvm.offloading | Embedded device object file for the target device and architecture |
249 +----------------------------------+------------------------------------------------------------------------------+
253 Linking Target Device Code
254 --------------------------
256 Objects containing :ref:`table-offloading_sections` require special handling to
257 create an executable device image. This is done using a Clang tool, see
258 :doc:`ClangLinkerWrapper` for more information. This tool works as a wrapper
259 over the host linking job. It scans the input object files for the offloading
260 section ``.llvm.offloading``. The device files stored in this section are then
261 extracted and passed to the appropriate linking job. The linked device image is
262 then :ref:`wrapped <Device Binary Wrapping>` to create the symbols used to load
263 the device image and link it with the host.
265 The linker wrapper tool supports linking bitcode files through link time
266 optimization (LTO). This is used whenever the object files embedded in the host
267 contain LLVM bitcode. Bitcode will be embedded for architectures that do not
268 support a relocatable object format, such as AMDGPU or SPIR-V, or if the user
269 requested it using the ``-foffload-lto`` flag.
271 .. _Device Binary Wrapping:
273 Device Binary Wrapping
274 ----------------------
276 Various structures and functions are used to create the information necessary to
277 offload code on the device. We use the :ref:`linked device executable <Device
278 Linking>` with the corresponding offloading entries to create the symbols
279 necessary to load and execute the device image.
284 Several different structures are used to store offloading information. The
285 :ref:`device image structure <table-device_image_structure>` stores a single
286 linked device image and its associated offloading entries. The offloading
287 entries are stored using the ``__start_omp_offloading_entries`` and
288 ``__stop_omp_offloading_entries`` symbols generated by the linker using the
289 :ref:`table-tgt_offload_entry_structure`.
291 .. table:: __tgt_device_image Structure
292 :name: table-device_image_structure
294 +----------------------+--------------+----------------------------------------+
295 | Type | Identifier | Description |
296 +======================+==============+========================================+
297 | void* | ImageStart | Pointer to the target code start |
298 +----------------------+--------------+----------------------------------------+
299 | void* | ImageEnd | Pointer to the target code end |
300 +----------------------+--------------+----------------------------------------+
301 | __tgt_offload_entry* | EntriesBegin | Begin of table with all target entries |
302 +----------------------+--------------+----------------------------------------+
303 | __tgt_offload_entry* | EntriesEnd | End of table (non inclusive) |
304 +----------------------+--------------+----------------------------------------+
306 The target :ref:`target binary descriptor <table-target_binary_descriptor>` is
307 used to store all binary images and offloading entries in an array.
309 .. table:: __tgt_bin_desc Structure
310 :name: table-target_binary_descriptor
312 +----------------------+------------------+------------------------------------------+
313 | Type | Identifier | Description |
314 +======================+==================+==========================================+
315 | int32_t | NumDeviceImages | Number of device types supported |
316 +----------------------+------------------+------------------------------------------+
317 | __tgt_device_image* | DeviceImages | Array of device images (1 per dev. type) |
318 +----------------------+------------------+------------------------------------------+
319 | __tgt_offload_entry* | HostEntriesBegin | Begin of table with all host entries |
320 +----------------------+------------------+------------------------------------------+
321 | __tgt_offload_entry* | HostEntriesEnd | End of table (non inclusive) |
322 +----------------------+------------------+------------------------------------------+
327 :ref:`table-global_variables` lists various global variables, along with their
328 type and their explicit ELF sections, which are used to store device images and
331 .. table:: Global Variables
332 :name: table-global_variables
334 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
335 | Variable | Type | ELF Section | Description |
336 +================================+=====================+=========================+=========================================================+
337 | __start_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | Begin symbol for the offload entries table. |
338 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
339 | __stop_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | End symbol for the offload entries table. |
340 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
341 | __dummy.omp_offloading.entry | __tgt_offload_entry | .omp_offloading_entries | Dummy zero-sized object in the offload entries |
342 | | | | section to force linker to define begin/end |
343 | | | | symbols defined above. |
344 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
345 | .omp_offloading.device_image | __tgt_device_image | .omp_offloading_entries | ELF device code object of the first image. |
346 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
347 | .omp_offloading.device_image.N | __tgt_device_image | .omp_offloading_entries | ELF device code object of the (N+1)th image. |
348 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
349 | .omp_offloading.device_images | __tgt_device_image | .omp_offloading_entries | Array of images. |
350 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
351 | .omp_offloading.descriptor | __tgt_bin_desc | .omp_offloading_entries | Binary descriptor object (see :ref:`binary_descriptor`) |
352 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
354 .. _binary_descriptor:
356 Binary Descriptor for Device Images
357 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
359 This object is passed to the offloading runtime at program startup and it
360 describes all device images available in the executable or shared library. It
361 is defined as follows:
365 __attribute__((visibility("hidden")))
366 extern __tgt_offload_entry *__start_omp_offloading_entries;
367 __attribute__((visibility("hidden")))
368 extern __tgt_offload_entry *__stop_omp_offloading_entries;
369 static const char Image0[] = { <Bufs.front() contents> };
371 static const char ImageN[] = { <Bufs.back() contents> };
372 static const __tgt_device_image Images[] = {
374 Image0, /*ImageStart*/
375 Image0 + sizeof(Image0), /*ImageEnd*/
376 __start_omp_offloading_entries, /*EntriesBegin*/
377 __stop_omp_offloading_entries /*EntriesEnd*/
381 ImageN, /*ImageStart*/
382 ImageN + sizeof(ImageN), /*ImageEnd*/
383 __start_omp_offloading_entries, /*EntriesBegin*/
384 __stop_omp_offloading_entries /*EntriesEnd*/
387 static const __tgt_bin_desc BinDesc = {
388 sizeof(Images) / sizeof(Images[0]), /*NumDeviceImages*/
389 Images, /*DeviceImages*/
390 __start_omp_offloading_entries, /*HostEntriesBegin*/
391 __stop_omp_offloading_entries /*HostEntriesEnd*/
395 Global Constructor and Destructor
396 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
398 The global constructor (``.omp_offloading.descriptor_reg()``) registers the
399 device images with the runtime by calling the ``__tgt_register_lib()`` runtime
400 function. The constructor is explicitly defined in ``.text.startup`` section and
401 is run once when the program starts. Similarly, the global destructor
402 (``.omp_offloading.descriptor_unreg()``) calls ``__tgt_unregister_lib()`` for
403 the destructor and is also defined in ``.text.startup`` section and run when the
409 This section contains a simple example of generating offloading code using
410 OpenMP offloading. We will use a simple ``ZAXPY`` BLAS routine.
416 using complex = std::complex<double>;
418 void zaxpy(complex *X, complex *Y, complex D, std::size_t N) {
419 #pragma omp target teams distribute parallel for
420 for (std::size_t i = 0; i < N; ++i)
421 Y[i] = D * X[i] + Y[i];
425 const std::size_t N = 1024;
426 complex X[N], Y[N], D;
427 #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])
431 This code is compiled using the following Clang flags.
433 .. code-block:: console
435 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 zaxpy.cpp -c
437 The output section in the object file can be seen using the ``readelf`` utility.
438 The ``.llvm.offloading`` section has the ``SHF_EXCLUDE`` flag so it will be
439 removed from the final executable or shared library by the linker.
443 $ llvm-readelf -WS zaxpy.o
445 [Nr] Name Type Address Off Size ES Flg Lk Inf Al
446 [11] omp_offloading_entries PROGBITS 0000000000000000 0001f0 000040 00 A 0 0 1
447 [12] .llvm.offloading PROGBITS 0000000000000000 000260 030950 00 E 0 0 8
450 Compiling this file again will invoke the ``clang-linker-wrapper`` utility to
451 extract and link the device code stored at the section named
452 ``.llvm.offloading`` and then use entries stored in
453 the section named ``omp_offloading_entries`` to create the symbols necessary for
454 ``libomptarget`` to register the device image and call the entry function.
456 .. code-block:: console
458 $ clang++ -fopenmp -fopenmp-targets=nvptx64 zaxpy.o -o zaxpy
461 We can see the steps created by clang to generate the offloading code using the
462 ``-ccc-print-phases`` option in Clang. This matches the description in
463 :ref:`Offloading Overview`.
465 .. code-block:: console
467 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -ccc-print-phases zaxpy.cpp
468 # "x86_64-unknown-linux-gnu" - "clang", inputs: ["zaxpy.cpp"], output: "/tmp/zaxpy-host.bc"
469 # "nvptx64-nvidia-cuda" - "clang", inputs: ["zaxpy.cpp", "/tmp/zaxpy-e6a41b.bc"], output: "/tmp/zaxpy-07f434.s"
470 # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/tmp/zaxpy-07f434.s"], output: "/tmp/zaxpy-0af7b7.o"
471 # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/zaxpy-e6a41b.bc", "/tmp/zaxpy-0af7b7.o"], output: "/tmp/zaxpy-416cad.o"
472 # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["/tmp/zaxpy-416cad.o"], output: "a.out"