llvm/docs/NVPTXUsage.rst

   1 =============================
   2 User Guide for NVPTX Back-end
   3 =============================
   4
   5 .. contents::
   6    :local:
   7    :depth: 3
   8
   9
  10 Introduction
  11 ============
  12
  13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
  14 along with a defined set of conventions used to represent GPU programming
  15 concepts. This document provides an overview of the general usage of the back-
  16 end, including a description of the conventions used and the set of accepted
  17 LLVM IR.
  18
  19 .. note::
  20
  21    This document assumes a basic familiarity with CUDA and the PTX
  22    assembly language. Information about the CUDA Driver API and the PTX assembly
  23    language can be found in the `CUDA documentation
  24    <http://docs.nvidia.com/cuda/index.html>`_.
  25
  26
  27
  28 Conventions
  29 ===========
  30
  31 Marking Functions as Kernels
  32 ----------------------------
  33
  34 In PTX, there are two types of functions: *device functions*, which are only
  35 callable by device code, and *kernel functions*, which are callable by host
  36 code. By default, the back-end will emit device functions. Metadata is used to
  37 declare a function as a kernel function. This metadata is attached to the
  38 ``nvvm.annotations`` named metadata object, and has the following format:
  39
  40 .. code-block:: text
  41
  42    !0 = !{<function-ref>, metadata !"kernel", i32 1}
  43
  44 The first parameter is a reference to the kernel function. The following
  45 example shows a kernel function calling a device function in LLVM IR. The
  46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
  47
  48 .. code-block:: llvm
  49
  50     define float @my_fmad(float %x, float %y, float %z) {
  51       %mul = fmul float %x, %y
  52       %add = fadd float %mul, %z
  53       ret float %add
  54     }
  55
  56     define void @my_kernel(float* %ptr) {
  57       %val = load float, float* %ptr
  58       %ret = call float @my_fmad(float %val, float %val, float %val)
  59       store float %ret, float* %ptr
  60       ret void
  61     }
  62
  63     !nvvm.annotations = !{!1}
  64     !1 = !{void (float*)* @my_kernel, !"kernel", i32 1}
  65
  66 When compiled, the PTX kernel functions are callable by host-side code.
  67
  68
  69 .. _address_spaces:
  70
  71 Address Spaces
  72 --------------
  73
  74 The NVPTX back-end uses the following address space mapping:
  75
  76    ============= ======================
  77    Address Space Memory Space
  78    ============= ======================
  79    0             Generic
  80    1             Global
  81    2             Internal Use
  82    3             Shared
  83    4             Constant
  84    5             Local
  85    ============= ======================
  86
  87 Every global variable and pointer type is assigned to one of these address
  88 spaces, with 0 being the default address space. Intrinsics are provided which
  89 can be used to convert pointers between the generic and non-generic address
  90 spaces.
  91
  92 As an example, the following IR will define an array ``@g`` that resides in
  93 global device memory.
  94
  95 .. code-block:: llvm
  96
  97     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
  98
  99 LLVM IR functions can read and write to this array, and host-side code can
 100 copy data to it by name with the CUDA Driver API.
 101
 102 Note that since address space 0 is the generic space, it is illegal to have
 103 global variables in address space 0.  Address space 0 is the default address
 104 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
 105 variables.
 106
 107
 108 Triples
 109 -------
 110
 111 The NVPTX target uses the module triple to select between 32/64-bit code
 112 generation and the driver-compiler interface to use. The triple architecture
 113 can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
 114 operating system should be one of ``cuda`` or ``nvcl``, which determines the
 115 interface used by the generated code to communicate with the driver.  Most
 116 users will want to use ``cuda`` as the operating system, which makes the
 117 generated PTX compatible with the CUDA Driver API.
 118
 119 Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
 120
 121 Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
 122
 123
 124
 125 .. _nvptx_intrinsics:
 126
 127 NVPTX Intrinsics
 128 ================
 129
 130 Address Space Conversion
 131 ------------------------
 132
 133 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
 134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 135
 136 Syntax:
 137 """""""
 138
 139 These are overloaded intrinsics.  You can use these on any pointer types.
 140
 141 .. code-block:: llvm
 142
 143     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
 144     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
 145     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
 146     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
 147
 148 Overview:
 149 """""""""
 150
 151 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
 152 address space to a generic address space pointer.
 153
 154 Semantics:
 155 """"""""""
 156
 157 These intrinsics modify the pointer value to be a valid generic address space
 158 pointer.
 159
 160
 161 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
 162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 163
 164 Syntax:
 165 """""""
 166
 167 These are overloaded intrinsics.  You can use these on any pointer types.
 168
 169 .. code-block:: llvm
 170
 171     declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
 172     declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
 173     declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
 174     declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
 175
 176 Overview:
 177 """""""""
 178
 179 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
 180 address space to a pointer in the target address space.  Note that these
 181 intrinsics are only useful if the address space of the target address space of
 182 the pointer is known.  It is not legal to use address space conversion
 183 intrinsics to convert a pointer from one non-generic address space to another
 184 non-generic address space.
 185
 186 Semantics:
 187 """"""""""
 188
 189 These intrinsics modify the pointer value to be a valid pointer in the target
 190 non-generic address space.
 191
 192
 193 Reading PTX Special Registers
 194 -----------------------------
 195
 196 '``llvm.nvvm.read.ptx.sreg.*``'
 197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 198
 199 Syntax:
 200 """""""
 201
 202 .. code-block:: llvm
 203
 204     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
 205     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
 206     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
 207     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 208     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
 209     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
 210     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
 211     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
 212     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
 213     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
 214     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
 215     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
 216     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
 217
 218 Overview:
 219 """""""""
 220
 221 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
 222 special registers, in particular the kernel launch bounds.  These registers
 223 map in the following way to CUDA builtins:
 224
 225    ============ =====================================
 226    CUDA Builtin PTX Special Register Intrinsic
 227    ============ =====================================
 228    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
 229    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
 230    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
 231    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
 232    ============ =====================================
 233
 234
 235 Barriers
 236 --------
 237
 238 '``llvm.nvvm.barrier0``'
 239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 240
 241 Syntax:
 242 """""""
 243
 244 .. code-block:: llvm
 245
 246   declare void @llvm.nvvm.barrier0()
 247
 248 Overview:
 249 """""""""
 250
 251 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 252 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 253
 254
 255 Other Intrinsics
 256 ----------------
 257
 258 For the full set of NVPTX intrinsics, please see the
 259 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
 260
 261
 262 .. _libdevice:
 263
 264 Linking with Libdevice
 265 ======================
 266
 267 The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
 268 implements many common mathematical functions. This library can be used as a
 269 high-performance math library for any compilers using the LLVM NVPTX target.
 270 The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
 271 there is a separate version for each compute architecture.
 272
 273 For a list of all math functions implemented in libdevice, see
 274 `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
 275
 276 To accommodate various math-related compiler flags that can affect code
 277 generation of libdevice code, the library code depends on a special LLVM IR
 278 pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
 279 pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
 280 with constants based on the defined reflection parameters. Such conditional
 281 code often follows a pattern:
 282
 283 .. code-block:: c++
 284
 285   float my_function(float a) {
 286     if (__nvvm_reflect("FASTMATH"))
 287       return my_function_fast(a);
 288     else
 289       return my_function_precise(a);
 290   }
 291
 292 The default value for all unspecified reflection parameters is zero.
 293
 294 The ``NVVMReflect`` pass should be executed early in the optimization
 295 pipeline, immediately after the link stage. The ``internalize`` pass is also
 296 recommended to remove unused math functions from the resulting PTX. For an
 297 input IR module ``module.bc``, the following compilation flow is recommended:
 298
 299 1. Save list of external functions in ``module.bc``
 300 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
 301 3. Internalize all functions not in list from (1)
 302 4. Eliminate all unused internal functions
 303 5. Run ``NVVMReflect`` pass
 304 6. Run standard optimization pipeline
 305
 306 .. note::
 307
 308   ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
 309   libdevice functions. It is possible to link two IR modules that have been
 310   linked against libdevice using different reflection variables.
 311
 312 Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
 313 often leave behind dead code of the form:
 314
 315 .. code-block:: llvm
 316
 317   entry:
 318     ..
 319     br i1 true, label %foo, label %bar
 320   foo:
 321     ..
 322   bar:
 323     ; Dead code
 324     ..
 325
 326 Therefore, it is recommended that ``NVVMReflect`` is executed early in the
 327 optimization pipeline before dead-code elimination.
 328
 329 The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning
 330 of your pass manager; just use the following code when setting up your pass
 331 manager and the PassBuilder will use ``registerPassBuilderCallbacks`` to let
 332 NVPTXTargetMachine::registerPassBuilderCallbacks add the the pass to the
 333 pass manager:
 334
 335 .. code-block:: c++
 336
 337     std::unique_ptr<TargetMachine> TM = ...;
 338     PassBuilder PB(TM);
 339     ModulePassManager MPM;
 340     PB.parsePassPipeline(MPM, ...);
 341
 342 Reflection Parameters
 343 ---------------------
 344
 345 The libdevice library currently uses the following reflection parameters to
 346 control code generation:
 347
 348 ==================== ======================================================
 349 Flag                 Description
 350 ==================== ======================================================
 351 ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
 352 ==================== ======================================================
 353
 354 The value of this flag is determined by the "nvvm-reflect-ftz" module flag.
 355 The following sets the ftz flag to 1.
 356
 357 .. code-block:: llvm
 358
 359     !llvm.module.flag = !{!0}
 360     !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
 361
 362 (``i32 4`` indicates that the value set here overrides the value in another
 363 module we link with.  See the `LangRef <LangRef.html#module-flags-metadata>`
 364 for details.)
 365
 366 Executing PTX
 367 =============
 368
 369 The most common way to execute PTX assembly on a GPU device is to use the CUDA
 370 Driver API. This API is a low-level interface to the GPU driver and allows for
 371 JIT compilation of PTX code to native GPU machine code.
 372
 373 Initializing the Driver API:
 374
 375 .. code-block:: c++
 376
 377     CUdevice device;
 378     CUcontext context;
 379
 380     // Initialize the driver API
 381     cuInit(0);
 382     // Get a handle to the first compute device
 383     cuDeviceGet(&device, 0);
 384     // Create a compute device context
 385     cuCtxCreate(&context, 0, device);
 386
 387 JIT compiling a PTX string to a device binary:
 388
 389 .. code-block:: c++
 390
 391     CUmodule module;
 392     CUfunction function;
 393
 394     // JIT compile a null-terminated PTX string
 395     cuModuleLoadData(&module, (void*)PTXString);
 396
 397     // Get a handle to the "myfunction" kernel function
 398     cuModuleGetFunction(&function, module, "myfunction");
 399
 400 For full examples of executing PTX assembly, please see the `CUDA Samples
 401 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
 402
 403
 404 Common Issues
 405 =============
 406
 407 ptxas complains of undefined function: __nvvm_reflect
 408 -----------------------------------------------------
 409
 410 When linking with libdevice, the ``NVVMReflect`` pass must be used. See
 411 :ref:`libdevice` for more information.
 412
 413
 414 Tutorial: A Simple Compute Kernel
 415 =================================
 416
 417 To start, let us take a look at a simple compute kernel written directly in
 418 LLVM IR. The kernel implements vector addition, where each thread computes one
 419 element of the output vector C from the input vectors A and B.  To make this
 420 easier, we also assume that only a single CTA (thread block) will be launched,
 421 and that it will be one dimensional.
 422
 423
 424 The Kernel
 425 ----------
 426
 427 .. code-block:: llvm
 428
 429   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 430   target triple = "nvptx64-nvidia-cuda"
 431
 432   ; Intrinsic to read X component of thread ID
 433   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 434
 435   define void @kernel(float addrspace(1)* %A,
 436                       float addrspace(1)* %B,
 437                       float addrspace(1)* %C) {
 438   entry:
 439     ; What is my ID?
 440     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 441
 442     ; Compute pointers into A, B, and C
 443     %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
 444     %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
 445     %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
 446
 447     ; Read A, B
 448     %valA = load float, float addrspace(1)* %ptrA, align 4
 449     %valB = load float, float addrspace(1)* %ptrB, align 4
 450
 451     ; Compute C = A + B
 452     %valC = fadd float %valA, %valB
 453
 454     ; Store back to C
 455     store float %valC, float addrspace(1)* %ptrC, align 4
 456
 457     ret void
 458   }
 459
 460   !nvvm.annotations = !{!0}
 461   !0 = !{void (float addrspace(1)*,
 462                float addrspace(1)*,
 463                float addrspace(1)*)* @kernel, !"kernel", i32 1}
 464
 465
 466 We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
 467
 468 .. code-block:: text
 469
 470   # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
 471
 472
 473 .. note::
 474
 475   If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
 476   in the module data layout string and use ``nvptx-nvidia-cuda`` as the
 477   target triple.
 478
 479
 480 The output we get from ``llc`` (as of LLVM 3.4):
 481
 482 .. code-block:: text
 483
 484   //
 485   // Generated by LLVM NVPTX Back-End
 486   //
 487
 488   .version 3.1
 489   .target sm_20
 490   .address_size 64
 491
 492     // .globl kernel
 493                                           // @kernel
 494   .visible .entry kernel(
 495     .param .u64 kernel_param_0,
 496     .param .u64 kernel_param_1,
 497     .param .u64 kernel_param_2
 498   )
 499   {
 500     .reg .f32   %f<4>;
 501     .reg .s32   %r<2>;
 502     .reg .s64   %rl<8>;
 503
 504   // %bb.0:                                // %entry
 505     ld.param.u64    %rl1, [kernel_param_0];
 506     mov.u32         %r1, %tid.x;
 507     mul.wide.s32    %rl2, %r1, 4;
 508     add.s64         %rl3, %rl1, %rl2;
 509     ld.param.u64    %rl4, [kernel_param_1];
 510     add.s64         %rl5, %rl4, %rl2;
 511     ld.param.u64    %rl6, [kernel_param_2];
 512     add.s64         %rl7, %rl6, %rl2;
 513     ld.global.f32   %f1, [%rl3];
 514     ld.global.f32   %f2, [%rl5];
 515     add.f32         %f3, %f1, %f2;
 516     st.global.f32   [%rl7], %f3;
 517     ret;
 518   }
 519
 520
 521 Dissecting the Kernel
 522 ---------------------
 523
 524 Now let us dissect the LLVM IR that makes up this kernel.
 525
 526 Data Layout
 527 ^^^^^^^^^^^
 528
 529 The data layout string determines the size in bits of common data types, their
 530 ABI alignment, and their storage size.  For NVPTX, you should use one of the
 531 following:
 532
 533 32-bit PTX:
 534
 535 .. code-block:: llvm
 536
 537   target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 538
 539 64-bit PTX:
 540
 541 .. code-block:: llvm
 542
 543   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 544
 545
 546 Target Intrinsics
 547 ^^^^^^^^^^^^^^^^^
 548
 549 In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
 550 read the X component of the current thread's ID, which corresponds to a read
 551 of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
 552 intrinsics.  A short list is shown below; please see
 553 ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
 554
 555
 556 ================================================ ====================
 557 Intrinsic                                        CUDA Equivalent
 558 ================================================ ====================
 559 ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
 560 ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
 561 ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
 562 ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
 563 ``void @llvm.nvvm.barrier0()``                   __syncthreads()
 564 ================================================ ====================
 565
 566
 567 Address Spaces
 568 ^^^^^^^^^^^^^^
 569
 570 You may have noticed that all of the pointer types in the LLVM IR example had
 571 an explicit address space specifier. What is address space 1? NVIDIA GPU
 572 devices (generally) have four types of memory:
 573
 574 - Global: Large, off-chip memory
 575 - Shared: Small, on-chip memory shared among all threads in a CTA
 576 - Local: Per-thread, private memory
 577 - Constant: Read-only memory shared across all threads
 578
 579 These different types of memory are represented in LLVM IR as address spaces.
 580 There is also a fifth address space used by the NVPTX code generator that
 581 corresponds to the "generic" address space.  This address space can represent
 582 addresses in any other address space (with a few exceptions).  This allows
 583 users to write IR functions that can load/store memory using the same
 584 instructions. Intrinsics are provided to convert pointers between the generic
 585 and non-generic address spaces.
 586
 587 See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
 588
 589
 590 Kernel Metadata
 591 ^^^^^^^^^^^^^^^
 592
 593 In PTX, a function can be either a `kernel` function (callable from the host
 594 program), or a `device` function (callable only from GPU code). You can think
 595 of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
 596 function as a `kernel` function, we make use of special LLVM metadata. The
 597 NVPTX back-end will look for a named metadata node called
 598 ``nvvm.annotations``. This named metadata must contain a list of metadata that
 599 describe the IR. For our purposes, we need to declare a metadata node that
 600 assigns the "kernel" attribute to the LLVM IR function that should be emitted
 601 as a PTX `kernel` function. These metadata nodes take the form:
 602
 603 .. code-block:: text
 604
 605   !{<function ref>, metadata !"kernel", i32 1}
 606
 607 For the previous example, we have:
 608
 609 .. code-block:: llvm
 610
 611   !nvvm.annotations = !{!0}
 612   !0 = !{void (float addrspace(1)*,
 613                float addrspace(1)*,
 614                float addrspace(1)*)* @kernel, !"kernel", i32 1}
 615
 616 Here, we have a single metadata declaration in ``nvvm.annotations``. This
 617 metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
 618
 619
 620 Running the Kernel
 621 ------------------
 622
 623 Generating PTX from LLVM IR is all well and good, but how do we execute it on
 624 a real GPU device? The CUDA Driver API provides a convenient mechanism for
 625 loading and JIT compiling PTX to a native GPU device, and launching a kernel.
 626 The API is similar to OpenCL.  A simple example showing how to load and
 627 execute our vector addition code is shown below. Note that for brevity this
 628 code does not perform much error checking!
 629
 630 .. note::
 631
 632   You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
 633   compile PTX to machine code (SASS) for a specific GPU architecture. Such
 634   binaries can be loaded by the CUDA Driver API in the same way as PTX. This
 635   can be useful for reducing startup time by precompiling the PTX kernels.
 636
 637
 638 .. code-block:: c++
 639
 640   #include <iostream>
 641   #include <fstream>
 642   #include <cassert>
 643   #include "cuda.h"
 644
 645
 646   void checkCudaErrors(CUresult err) {
 647     assert(err == CUDA_SUCCESS);
 648   }
 649
 650   /// main - Program entry point
 651   int main(int argc, char **argv) {
 652     CUdevice    device;
 653     CUmodule    cudaModule;
 654     CUcontext   context;
 655     CUfunction  function;
 656     CUlinkState linker;
 657     int         devCount;
 658
 659     // CUDA initialization
 660     checkCudaErrors(cuInit(0));
 661     checkCudaErrors(cuDeviceGetCount(&devCount));
 662     checkCudaErrors(cuDeviceGet(&device, 0));
 663
 664     char name[128];
 665     checkCudaErrors(cuDeviceGetName(name, 128, device));
 666     std::cout << "Using CUDA Device [0]: " << name << "\n";
 667
 668     int devMajor, devMinor;
 669     checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
 670     std::cout << "Device Compute Capability: "
 671               << devMajor << "." << devMinor << "\n";
 672     if (devMajor < 2) {
 673       std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
 674       return 1;
 675     }
 676
 677     std::ifstream t("kernel.ptx");
 678     if (!t.is_open()) {
 679       std::cerr << "kernel.ptx not found\n";
 680       return 1;
 681     }
 682     std::string str((std::istreambuf_iterator<char>(t)),
 683                       std::istreambuf_iterator<char>());
 684
 685     // Create driver context
 686     checkCudaErrors(cuCtxCreate(&context, 0, device));
 687
 688     // Create module for object
 689     checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
 690
 691     // Get kernel function
 692     checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
 693
 694     // Device data
 695     CUdeviceptr devBufferA;
 696     CUdeviceptr devBufferB;
 697     CUdeviceptr devBufferC;
 698
 699     checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
 700     checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
 701     checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
 702
 703     float* hostA = new float[16];
 704     float* hostB = new float[16];
 705     float* hostC = new float[16];
 706
 707     // Populate input
 708     for (unsigned i = 0; i != 16; ++i) {
 709       hostA[i] = (float)i;
 710       hostB[i] = (float)(2*i);
 711       hostC[i] = 0.0f;
 712     }
 713
 714     checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
 715     checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
 716
 717
 718     unsigned blockSizeX = 16;
 719     unsigned blockSizeY = 1;
 720     unsigned blockSizeZ = 1;
 721     unsigned gridSizeX  = 1;
 722     unsigned gridSizeY  = 1;
 723     unsigned gridSizeZ  = 1;
 724
 725     // Kernel parameters
 726     void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
 727
 728     std::cout << "Launching kernel\n";
 729
 730     // Kernel launch
 731     checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
 732                                    blockSizeX, blockSizeY, blockSizeZ,
 733                                    0, NULL, KernelParams, NULL));
 734
 735     // Retrieve device data
 736     checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
 737
 738
 739     std::cout << "Results:\n";
 740     for (unsigned i = 0; i != 16; ++i) {
 741       std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
 742     }
 743
 744
 745     // Clean up after ourselves
 746     delete [] hostA;
 747     delete [] hostB;
 748     delete [] hostC;
 749
 750     // Clean-up
 751     checkCudaErrors(cuMemFree(devBufferA));
 752     checkCudaErrors(cuMemFree(devBufferB));
 753     checkCudaErrors(cuMemFree(devBufferC));
 754     checkCudaErrors(cuModuleUnload(cudaModule));
 755     checkCudaErrors(cuCtxDestroy(context));
 756
 757     return 0;
 758   }
 759
 760
 761 You will need to link with the CUDA driver and specify the path to cuda.h.
 762
 763 .. code-block:: text
 764
 765   # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
 766
 767 We don't need to specify a path to ``libcuda.so`` since this is installed in a
 768 system location by the driver, not the CUDA toolkit.
 769
 770 If everything goes as planned, you should see the following output when
 771 running the compiled program:
 772
 773 .. code-block:: text
 774
 775   Using CUDA Device [0]: GeForce GTX 680
 776   Device Compute Capability: 3.0
 777   Launching kernel
 778   Results:
 779   0 + 0 = 0
 780   1 + 2 = 3
 781   2 + 4 = 6
 782   3 + 6 = 9
 783   4 + 8 = 12
 784   5 + 10 = 15
 785   6 + 12 = 18
 786   7 + 14 = 21
 787   8 + 16 = 24
 788   9 + 18 = 27
 789   10 + 20 = 30
 790   11 + 22 = 33
 791   12 + 24 = 36
 792   13 + 26 = 39
 793   14 + 28 = 42
 794   15 + 30 = 45
 795
 796 .. note::
 797
 798   You will likely see a different device identifier based on your hardware
 799
 800
 801 Tutorial: Linking with Libdevice
 802 ================================
 803
 804 In this tutorial, we show a simple example of linking LLVM IR with the
 805 libdevice library. We will use the same kernel as the previous tutorial,
 806 except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
 807 Libdevice provides an ``__nv_powf`` function that we will use.
 808
 809 .. code-block:: llvm
 810
 811   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 812   target triple = "nvptx64-nvidia-cuda"
 813
 814   ; Intrinsic to read X component of thread ID
 815   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 816   ; libdevice function
 817   declare float @__nv_powf(float, float)
 818
 819   define void @kernel(float addrspace(1)* %A,
 820                       float addrspace(1)* %B,
 821                       float addrspace(1)* %C) {
 822   entry:
 823     ; What is my ID?
 824     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 825
 826     ; Compute pointers into A, B, and C
 827     %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
 828     %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
 829     %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
 830
 831     ; Read A, B
 832     %valA = load float, float addrspace(1)* %ptrA, align 4
 833     %valB = load float, float addrspace(1)* %ptrB, align 4
 834
 835     ; Compute C = pow(A, B)
 836     %valC = call float @__nv_powf(float %valA, float %valB)
 837
 838     ; Store back to C
 839     store float %valC, float addrspace(1)* %ptrC, align 4
 840
 841     ret void
 842   }
 843
 844   !nvvm.annotations = !{!0}
 845   !0 = !{void (float addrspace(1)*,
 846                float addrspace(1)*,
 847                float addrspace(1)*)* @kernel, !"kernel", i32 1}
 848
 849
 850 To compile this kernel, we perform the following steps:
 851
 852 1. Link with libdevice
 853 2. Internalize all but the public kernel function
 854 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
 855 4. Optimize the linked module
 856 5. Codegen the module
 857
 858
 859 These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
 860 tools. In a complete compiler, these steps can also be performed entirely
 861 programmatically by setting up an appropriate pass configuration (see
 862 :ref:`libdevice`).
 863
 864 .. code-block:: text
 865
 866   # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
 867   # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
 868   # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
 869
 870 .. note::
 871
 872   The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
 873   undefined variables will default to zero. It is shown here for evaluation
 874   purposes.
 875
 876
 877 This gives us the following PTX (excerpt):
 878
 879 .. code-block:: text
 880
 881   //
 882   // Generated by LLVM NVPTX Back-End
 883   //
 884
 885   .version 3.1
 886   .target sm_20
 887   .address_size 64
 888
 889     // .globl kernel
 890                                           // @kernel
 891   .visible .entry kernel(
 892     .param .u64 kernel_param_0,
 893     .param .u64 kernel_param_1,
 894     .param .u64 kernel_param_2
 895   )
 896   {
 897     .reg .pred  %p<30>;
 898     .reg .f32   %f<111>;
 899     .reg .s32   %r<21>;
 900     .reg .s64   %rl<8>;
 901
 902   // %bb.0:                                // %entry
 903     ld.param.u64  %rl2, [kernel_param_0];
 904     mov.u32   %r3, %tid.x;
 905     ld.param.u64  %rl3, [kernel_param_1];
 906     mul.wide.s32  %rl4, %r3, 4;
 907     add.s64   %rl5, %rl2, %rl4;
 908     ld.param.u64  %rl6, [kernel_param_2];
 909     add.s64   %rl7, %rl3, %rl4;
 910     add.s64   %rl1, %rl6, %rl4;
 911     ld.global.f32   %f1, [%rl5];
 912     ld.global.f32   %f2, [%rl7];
 913     setp.eq.f32 %p1, %f1, 0f3F800000;
 914     setp.eq.f32 %p2, %f2, 0f00000000;
 915     or.pred   %p3, %p1, %p2;
 916     @%p3 bra  BB0_1;
 917     bra.uni   BB0_2;
 918   BB0_1:
 919     mov.f32   %f110, 0f3F800000;
 920     st.global.f32   [%rl1], %f110;
 921     ret;
 922   BB0_2:                                  // %__nv_isnanf.exit.i
 923     abs.f32   %f4, %f1;
 924     setp.gtu.f32  %p4, %f4, 0f7F800000;
 925     @%p4 bra  BB0_4;
 926   // %bb.3:                                // %__nv_isnanf.exit5.i
 927     abs.f32   %f5, %f2;
 928     setp.le.f32 %p5, %f5, 0f7F800000;
 929     @%p5 bra  BB0_5;
 930   BB0_4:                                  // %.critedge1.i
 931     add.f32   %f110, %f1, %f2;
 932     st.global.f32   [%rl1], %f110;
 933     ret;
 934   BB0_5:                                  // %__nv_isinff.exit.i
 935
 936     ...
 937
 938   BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
 939     mul.f32   %f90, %f107, 0f3FB8AA3B;
 940     cvt.rzi.f32.f32 %f91, %f90;
 941     mov.f32   %f92, 0fBF317200;
 942     fma.rn.f32  %f93, %f91, %f92, %f107;
 943     mov.f32   %f94, 0fB5BFBE8E;
 944     fma.rn.f32  %f95, %f91, %f94, %f93;
 945     mul.f32   %f89, %f95, 0f3FB8AA3B;
 946     // inline asm
 947     ex2.approx.ftz.f32 %f88,%f89;
 948     // inline asm
 949     add.f32   %f96, %f91, 0f00000000;
 950     ex2.approx.f32  %f97, %f96;
 951     mul.f32   %f98, %f88, %f97;
 952     setp.lt.f32 %p15, %f107, 0fC2D20000;
 953     selp.f32  %f99, 0f00000000, %f98, %p15;
 954     setp.gt.f32 %p16, %f107, 0f42D20000;
 955     selp.f32  %f110, 0f7F800000, %f99, %p16;
 956     setp.eq.f32 %p17, %f110, 0f7F800000;
 957     @%p17 bra   BB0_28;
 958   // %bb.27:
 959     fma.rn.f32  %f110, %f110, %f108, %f110;
 960   BB0_28:                                 // %__internal_accurate_powf.exit.i
 961     setp.lt.f32 %p18, %f1, 0f00000000;
 962     setp.eq.f32 %p19, %f3, 0f3F800000;
 963     and.pred    %p20, %p18, %p19;
 964     @!%p20 bra  BB0_30;
 965     bra.uni   BB0_29;
 966   BB0_29:
 967     mov.b32    %r9, %f110;
 968     xor.b32   %r10, %r9, -2147483648;
 969     mov.b32    %f110, %r10;
 970   BB0_30:                                 // %__nv_powf.exit
 971     st.global.f32   [%rl1], %f110;
 972     ret;
 973   }