doc/languages-frameworks/cuda.section.md

   1 # CUDA {#cuda}
   2
   3 CUDA-only packages are stored in the `cudaPackages` packages set. This set
   4 includes the `cudatoolkit`, portions of the toolkit in separate derivations,
   5 `cudnn`, `cutensor` and `nccl`.
   6
   7 A package set is available for each CUDA version, so for example
   8 `cudaPackages_11_6`. Within each set is a matching version of the above listed
   9 packages. Additionally, other versions of the packages that are packaged and
  10 compatible are available as well. For example, there can be a
  11 `cudaPackages.cudnn_8_3` package.
  12
  13 To use one or more CUDA packages in an expression, give the expression a `cudaPackages` parameter, and in case CUDA is optional
  14 ```nix
  15 { config
  16 , cudaSupport ? config.cudaSupport
  17 , cudaPackages ? { }
  18 , ...
  19 }: {}
  20 ```
  21
  22 When using `callPackage`, you can choose to pass in a different variant, e.g.
  23 when a different version of the toolkit suffices
  24 ```nix
  25 {
  26   mypkg = callPackage { cudaPackages = cudaPackages_11_5; };
  27 }
  28 ```
  29
  30 If another version of say `cudnn` or `cutensor` is needed, you can override the
  31 package set to make it the default. This guarantees you get a consistent package
  32 set.
  33 ```nix
  34 {
  35   mypkg = let
  36     cudaPackages = cudaPackages_11_5.overrideScope (final: prev: {
  37       cudnn = prev.cudnn_8_3;
  38     });
  39   in callPackage { inherit cudaPackages; };
  40 }
  41 ```
  42
  43 The CUDA NVCC compiler requires flags to determine which hardware you
  44 want to target for in terms of SASS (real hardware) or PTX (JIT kernels).
  45
  46 Nixpkgs tries to target support real architecture defaults based on the
  47 CUDA toolkit version with PTX support for future hardware.  Experienced
  48 users may optimize this configuration for a variety of reasons such as
  49 reducing binary size and compile time, supporting legacy hardware, or
  50 optimizing for specific hardware.
  51
  52 You may provide capabilities to add support or reduce binary size through
  53 `config` using `cudaCapabilities = [ "6.0" "7.0" ];` and
  54 `cudaForwardCompat = true;` if you want PTX support for future hardware.
  55
  56 Please consult [GPUs supported](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)
  57 for your specific card(s).
  58
  59 Library maintainers should consult [NVCC Docs](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/)
  60 and release notes for their software package.
  61
  62 ## Adding a new CUDA release {#adding-a-new-cuda-release}
  63
  64 > **WARNING**
  65 >
  66 > This section of the docs is still very much in progress. Feedback is welcome in GitHub Issues tagging @NixOS/cuda-maintainers or on [Matrix](https://matrix.to/#/#cuda:nixos.org).
  67
  68 The CUDA Toolkit is a suite of CUDA libraries and software meant to provide a development environment for CUDA-accelerated applications. Until the release of CUDA 11.4, NVIDIA had only made the CUDA Toolkit available as a multi-gigabyte runfile installer, which we provide through the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute. From CUDA 11.4 and onwards, NVIDIA has also provided CUDA redistributables (“CUDA-redist”): individually packaged CUDA Toolkit components meant to facilitate redistribution and inclusion in downstream projects. These packages are available in the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set.
  69
  70 All new projects should use the CUDA redistributables available in [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) in place of [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit), as they are much easier to maintain and update.
  71
  72 ### Updating CUDA redistributables {#updating-cuda-redistributables}
  73
  74 1. Go to NVIDIA's index of CUDA redistributables: <https://developer.download.nvidia.com/compute/cuda/redist/>
  75 2. Make a note of the new version of CUDA available.
  76 3. Run
  77
  78    ```bash
  79    nix run github:connorbaker/cuda-redist-find-features -- \
  80       download-manifests \
  81       --log-level DEBUG \
  82       --version <newest CUDA version> \
  83       https://developer.download.nvidia.com/compute/cuda/redist \
  84       ./pkgs/development/cuda-modules/cuda/manifests
  85    ```
  86
  87    This will download a copy of the manifest for the new version of CUDA.
  88 4. Run
  89
  90    ```bash
  91    nix run github:connorbaker/cuda-redist-find-features -- \
  92       process-manifests \
  93       --log-level DEBUG \
  94       --version <newest CUDA version> \
  95       https://developer.download.nvidia.com/compute/cuda/redist \
  96       ./pkgs/development/cuda-modules/cuda/manifests
  97    ```
  98
  99    This will generate a `redistrib_features_<newest CUDA version>.json` file in the same directory as the manifest.
 100 5. Update the `cudaVersionMap` attribute set in `pkgs/development/cuda-modules/cuda/extension.nix`.
 101
 102 ### Updating cuTensor {#updating-cutensor}
 103
 104 1. Repeat the steps present in [Updating CUDA redistributables](#updating-cuda-redistributables) with the following changes:
 105    - Use the index of cuTensor redistributables: <https://developer.download.nvidia.com/compute/cutensor/redist>
 106    - Use the newest version of cuTensor available instead of the newest version of CUDA.
 107    - Use `pkgs/development/cuda-modules/cutensor/manifests` instead of `pkgs/development/cuda-modules/cuda/manifests`.
 108    - Skip the step of updating `cudaVersionMap` in `pkgs/development/cuda-modules/cuda/extension.nix`.
 109
 110 ### Updating supported compilers and GPUs {#updating-supported-compilers-and-gpus}
 111
 112 1. Update `nvcc-compatibilities.nix` in `pkgs/development/cuda-modules/` to include the newest release of NVCC, as well as any newly supported host compilers.
 113 2. Update `gpus.nix` in `pkgs/development/cuda-modules/` to include any new GPUs supported by the new release of CUDA.
 114
 115 ### Updating the CUDA Toolkit runfile installer {#updating-the-cuda-toolkit}
 116
 117 > **WARNING**
 118 >
 119 > While the CUDA Toolkit runfile installer is still available in Nixpkgs as the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute, its use is not recommended and should it be considered deprecated. Please migrate to the CUDA redistributables provided by the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set.
 120 >
 121 > To ensure packages relying on the CUDA Toolkit runfile installer continue to build, it will continue to be updated until a migration path is available.
 122
 123 1. Go to NVIDIA's CUDA Toolkit runfile installer download page: <https://developer.nvidia.com/cuda-downloads>
 124 2. Select the appropriate OS, architecture, distribution, and version, and installer type.
 125
 126    - For example: Linux, x86_64, Ubuntu, 22.04, runfile (local)
 127    - NOTE: Typically, we use the Ubuntu runfile. It is unclear if the runfile for other distributions will work.
 128
 129 3. Take the link provided by the installer instructions on the webpage after selecting the installer type and get its hash by running:
 130
 131    ```bash
 132    nix store prefetch-file --hash-type sha256 <link>
 133    ```
 134
 135 4. Update `pkgs/development/cuda-modules/cudatoolkit/releases.nix` to include the release.
 136
 137 ### Updating the CUDA package set {#updating-the-cuda-package-set}
 138
 139 1. Include a new `cudaPackages_<major>_<minor>` package set in `pkgs/top-level/all-packages.nix`.
 140
 141    - NOTE: Changing the default CUDA package set should occur in a separate PR, allowing time for additional testing.
 142
 143 2. Successfully build the closure of the new package set, updating `pkgs/development/cuda-modules/cuda/overrides.nix` as needed. Below are some common failures:
 144
 145 | Unable to ... | During ... | Reason | Solution | Note |
 146 | --- | --- | --- | --- | --- |
 147 | Find headers | `configurePhase` or `buildPhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contain the headers |
 148 | Find libraries | `configurePhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contain CMake configuration files |
 149 | Find libraries | `buildPhase` or `patchelf` | Missing dependency on a `lib` or `static` output | Add the missing dependency | The `lib` or `static` output typically contain the libraries |
 150
 151 In the scenario you are unable to run the resulting binary: this is arguably the most complicated as it could be any combination of the previous reasons. This type of failure typically occurs when a library attempts to load or open a library it depends on that it does not declare in its `DT_NEEDED` section. As a first step, ensure that dependencies are patched with [`autoAddDriverRunpath`](https://search.nixos.org/packages?channel=unstable&type=packages&query=autoAddDriverRunpath). Failing that, try running the application with [`nixGL`](https://github.com/guibou/nixGL) or a similar wrapper tool. If that works, it likely means that the application is attempting to load a library that is not in the `RPATH` or `RUNPATH` of the binary.
 152
 153 ## Running Docker or Podman containers with CUDA support {#running-docker-or-podman-containers-with-cuda-support}
 154
 155 It is possible to run Docker or Podman containers with CUDA support. The recommended mechanism to perform this task is to use the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html).
 156
 157 The NVIDIA Container Toolkit can be enabled in NixOS like follows:
 158
 159 ```nix
 160 {
 161   hardware.nvidia-container-toolkit.enable = true;
 162 }
 163 ```
 164
 165 This will automatically enable a service that generates a CDI specification (located at `/var/run/cdi/nvidia-container-toolkit.json`) based on the auto-detected hardware of your machine. You can check this service by running:
 166
 167 ```ShellSession
 168 $ systemctl status nvidia-container-toolkit-cdi-generator.service
 169 ```
 170
 171 ::: {.note}
 172 Depending on what settings you had already enabled in your system, you might need to restart your machine in order for the NVIDIA Container Toolkit to generate a valid CDI specification for your machine.
 173 :::
 174
 175 Once that a valid CDI specification has been generated for your machine on boot time, both Podman and Docker (> 25) will use this spec if you provide them with the `--device` flag:
 176
 177 ```ShellSession
 178 $ podman run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
 179 GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
 180 GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: <REDACTED>)
 181 ```
 182
 183 ```ShellSession
 184 $ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
 185 GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
 186 GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: <REDACTED>)
 187 ```
 188
 189 You can check all the identifiers that have been generated for your auto-detected hardware by checking the contents of the `/var/run/cdi/nvidia-container-toolkit.json` file:
 190
 191 ```ShellSession
 192 $ nix run nixpkgs#jq -- -r '.devices[].name' < /var/run/cdi/nvidia-container-toolkit.json
 193 0
 194 1
 195 all
 196 ```
 197
 198 ### Specifying what devices to expose to the container {#specifying-what-devices-to-expose-to-the-container}
 199
 200 You can choose what devices are exposed to your containers by using the identifier on the generated CDI specification. Like follows:
 201
 202 ```ShellSession
 203 $ podman run --rm -it --device=nvidia.com/gpu=0 ubuntu:latest nvidia-smi -L
 204 GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
 205 ```
 206
 207 You can repeat the `--device` argument as many times as necessary if you have multiple GPU's and you want to pick up which ones to expose to the container:
 208
 209 ```ShellSession
 210 $ podman run --rm -it --device=nvidia.com/gpu=0 --device=nvidia.com/gpu=1 ubuntu:latest nvidia-smi -L
 211 GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
 212 GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: <REDACTED>)
 213 ```
 214
 215 ::: {.note}
 216 By default, the NVIDIA Container Toolkit will use the GPU index to identify specific devices. You can change the way to identify what devices to expose by using the `hardware.nvidia-container-toolkit.device-name-strategy` NixOS attribute.
 217 :::
 218
 219 ### Using docker-compose {#using-docker-compose}
 220
 221 It's possible to expose GPU's to a `docker-compose` environment as well. With a `docker-compose.yaml` file like follows:
 222
 223 ```yaml
 224 services:
 225   some-service:
 226     image: ubuntu:latest
 227     command: sleep infinity
 228     deploy:
 229       resources:
 230         reservations:
 231           devices:
 232           - driver: cdi
 233             device_ids:
 234             - nvidia.com/gpu=all
 235 ```
 236
 237 In the same manner, you can pick specific devices that will be exposed to the container:
 238
 239 ```yaml
 240 services:
 241   some-service:
 242     image: ubuntu:latest
 243     command: sleep infinity
 244     deploy:
 245       resources:
 246         reservations:
 247           devices:
 248           - driver: cdi
 249             device_ids:
 250             - nvidia.com/gpu=0
 251             - nvidia.com/gpu=1
 252 ```