まいどぅー

The Compatibility between CUDA, GPU, Base Image, and PyTorch

TL; DR

  • Host CUDA VS Base Image CUDA: The CUDA verision within a runtime docker image has no relationship with the CUDA version on the host machie. The only thing we need to care about is whether the driver version on the host supports the base image’s CUDA runtime. Check the driver compatibility here
  • PyTorch VS CUDA: PyTorch is compatible with one or a few specific CUDA versions, more precisely, CUDA runtime APIs. Check the compatible matrix here
  • CUDA VS GPU: Each GPU architecture is compatible with certain CUDA versions, more precisely, CUDA driver versions. Quick check here

  • PyTorch and GPU: PyTorch only supports GPU specified in TORCH_CUDA_ARCH_LIST when compiled

The relationship between the CUDA version, GPU architecture, and PyTorch version can be a bit complex but is crucial for the proper functioning of PyTorch-based deep learning tasks on a GPU.

Suppose you’re planning to deploy your awesome service on an NVIDIA A100-PCIE-40Gb server with CUDA 11.2 and Driver Version 460.32.03. You’ve built your service using PyTorch 1.12.1, and your Docker image is built based on an NVIDIA base image, specifically nvidia-cuda:10.2-base-ubuntu20.04. How can you judge whether your service can run smoothly on the machine without iterative attempts?

To clarify this complicated compatibility problem, let’s take a quick recap of the key terminologies we mentioned above.

Basic Concepts

GPU Architecture

NVIDIA releases new generations of GPUs every year that are based on different architectures, such as Kepler, Maxwell, Pascal, Volta, Turing, Ampere, and up to Hopper as of 2023. These architectures have different capabilities and features, specified by their Compute Capability version (e.g., sm_35, sm_60, sm_80, etc.). “sm” stands for “streaming multiprocessor,” which is a key GPU component responsible for carrying out computations. The number following “sm” represents the architecture’s version. We denote it as GPU code in the following context.

For example, “sm_70” which corresponds to the Tesla V100 GPU. When you specify a particular architecture with nvcc, the compiler will optimize your code for that architecture. As a result, your compiled code may not be fully compatible with GPUs based on different architectures.

You can find more detailed explanations in this post.

CUDA Version

The terms “CUDA” and “CUDA Toolkits” often appear together. “CUDA XX.X” is shorten for the version of the CUDA Toolkits.It serves as an interface between the software (like PyTorch) and the hardware (like NVIDIA GPU).

CUDA Toolkits include:

  1. Libraries and Utilities: The CUDA Toolkit provides a collection of libraries and utilities that allow developers to build and profile CUDA-enabled applications, such as CuDNN.
  2. CUDA Runtime API: The Toolkit includes the CUDA runtime, which provides the application programming interface (API) used for tasks like allocating memory on the GPU, transferring data between the CPU and GPU, and launching kernels (compute functions) on the GPU. CUDA runtime APIs are generally designed to be forward-compatible with newer drivers.
  3. NVCC Compiler: The Toolkit includes the nvcc compiler for compiling CUDA code into GPU-executable code.

PyTorch Version

PyTorch releases are often tightly bound to specific CUDA versions for compatibility and performance reasons.

Base Image

Copied from NVIDIA docker homepage:

base: Includes the CUDA runtime (cudart)

runtime: Builds on the base and includes the CUDA math libraries, and NCCL. A runtime image that also includes cuDNN is available.

devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.

Interrelation

CUDA and Base Image

The base image only contains the minimum required dependencies to deploy a pre-built CUDA application. Importantly, there’s no requirement for the CUDA version in the base image to match the CUDA version on the host machine.

Back to our deployment case

  • our service is built based on nvidia-cuda:10.2-base-ubuntu20.04 image
  • The host machine has a CUDA driver that supports up to CUDA 11.2

In this setup, the service built with nvidia-cuda:10.2-base-ubuntu20.04 image doesn’t mean there installs a driver which supports CUDA 10.2 inside the image; instead, it relies on the host’s driver which can support up to CUDA 11.7.

Therefore, the service container will use the CUDA 10.2 runtime API, and because the host driver (supporting up to CUDA 11.2) is forward-compatible with older CUDA runtime versions, the application should run without any issues.

Therefore, the only one critical point you need to consider is that

Whether the driver version on the host supports the base image's CUDA runtime

The CUDA runtime version inside the container must be less than or equal to the CUDA driver version on the host system, or else you might encounter compatibility issues and the service will fail to start with an error message as:

CUDA driver version is insufficient for CUDA runtime version

A version-compatible matrix between the CUDA and driver can be found here.

Besides, there is still one consideration you should never miss. According to the line 16 in the dockerfile of nvidia-cuda:10.2-base-ubuntu20.04

ENV NVIDIA_REQUIRE_CUDA=cuda>=10.2

The base image requires a minimum CUDA version of the host.

Up till now,

  • host has CUDA11.2 >= 10.2. the base image is compatible with host

  • host driver 460.32.03 meets the minimum requirements of CUDA 10.2

PyTorch and CUDA

PyTorch versions is compatible with one or a few specific CUDA versions, or more precisely, with corresponding CUDA runtime API versions. Using an incompatible version might lead to errors or sub-optimal performance.

Following is the Release Compatibility Matrix for PyTorch, copied from here:

PyTorch version Stable CUDA Experimental CUDA
2.1 CUDA 11.8, CUDNN 8.7.0.84 CUDA 12.1, CUDNN 8.9.2.26
2.0 CUDA 11.7, CUDNN 8.5.0.96 CUDA 11.8, CUDNN 8.7.0.84
1.13 CUDA 11.6, CUDNN 8.3.2.44 CUDA 11.7, CUDNN 8.5.0.96
1.12 CUDA 11.3, CUDNN 8.3.2.44 CUDA 11.6, CUDNN 8.3.2.44

The official PyTorch webpage provides three examples of CUDA version that are compatible with PyTorch 1.12, ranging from CUDA 10.2 to CUDA 11.6. Therefore, PyTorch 1.12.1 in our scenario passes the compatible test.

So far so good, we have:

  • PyTorch1.12 is compatible with CUDA 11.2

CUDA and GPU

Each GPU architectures is compatible with certain CUDA versions, or more precisely, CUDA driver versions. As for Ampere, the compatibility is shown as below, copied from this post:

Ampere (CUDA 11.1 and later)

  • SM80 or SM_80, compute_80 – NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
  • SM86 or SM_86, compute_86 – (from CUDA 11.1 onwards) Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU

  • SM87 or SM_87, compute_87 – (from CUDA 11.4 onwards, introduced with PTX ISA 7.4 / Driver r470 and newer) – for Jetson AGX Orin and Drive AGX Orin only

We therefore draw the conclusion:

  • NVIDIA A100-PCIE-40Gb is compatible with CUDA 11.2

PyTorch and GPU

A particular version of PyTorch will be compatible only with the set of GPUs whose compatible CUDA versions overlap with the CUDA versions that PyTorch supports.

PyTorch libraries can be compiled from source codes into two forms, binary cubin objects and forward-compatible PTX assembly for each kernel. Both cubin and PTX are generated for a certain target compute capability. A cubin generated for a certain compute capability is supported to run on any GPU with the same major revision and same or higher minor revision of compute capability. For example, a cubin generated for compute capability 7.0 is supported to run on a GPU with compute capability 7.5, however a cubin generated for compute capability 7.5 is not supported to run on a GPU with compute capability 7.0, and a cubin generated with compute capability 7.x is not supported to run on a GPU with compute capability 8.x.

When the developers of PyTorch release a new version, they include a flag, TORCH_CUDA_ARCH_LIST, in the setup.py. In this flag, they can specify which CUDA architecture to build for, such as TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0". Remember numbers in TORCH_CUDA_ARCH_LIST are not CUDA versions, these numbers refers to the NVIDIA GPU architectures, such as 7.5 for the Turing architecture and 8.x for the Ampere architecture.

Here is a helpful table for reference, credit to dagelf

nvcc tag TORCH_CUDA_ARCH_LIST GPU Arch Year eg. GPU
sm_50, sm_52 and sm_53 5.0 5.1 5.3 Maxwell support 2014 GTX 9xx
sm_60, sm_61, and sm_62 6.0 6.1 6.2 Pascal support 2016 10xx, Pxxx
sm_70 and sm_72 7.0 7.2 Volta support 2017 Titan V
sm_75 7.5 Turing support 2018 most 20xx
sm_80, sm_86 and sm_87 8.0 8.6 8.7 Ampere support 2020 RTX 30xx, Axx[xx]
sm_89 8.9 Ada support 2022 RTX xxxx
sm_90, sm_90a 9.0 9.0a Hopper support 2022 H100

Back to our scenarios, we need check whether PyTorch 1.12.1 can be compatible with NVIDIA Ampere GPU

The quickest step towards judging the capability is to check if the application binary already contains compatible GPU code. As long as PyTorch libraries are built to include GPU arch>=8.0or PTX form or both in in TORCH_CUDA_ARCH_LIST, they should work smoothly with the NVIDIA Ampere GPU architecture.

If the PyTorch libraries you are using is either compiled with corresponding TORCH_CUDA_ARCH_LIST, nor compiled in PTX, you can find an error like:

A100-PCIE-40Gb with CUDA capability sm_80 is not compatible with current PyTorch installation

The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70

Back to our scenarios, this time, the combability test fails.

  • Pytorch 1.12.1 fails to be compatible with NVIDIA A100-PCIE-40Gb

Conclusion

Now we can certainly know if the service which is built with PyTorch 1.12.1, and based on nvidia-cuda:10.2-base-ubuntu20.04, is compatible with an NVIDIA A100-PCIE-40Gb machine with CUDA 11.2 and Driver Version 460.32.03.

Compatibility Status
CUDA and Base Image
PyTorch and GPU
PyTorch and CUDA
CUDA and GPU

The answer is NO. Then, how do we fix it?

Since current PyTorch fails to be compatible with A100, we might want to upgrade to PyTorch 1.13.1 or even later version. Besisdes, since PyTorch 1.13.1 needs CUDA runtime api >= 11.6, we also need to upgrade the base image with a runtime >= 11.6. To be compatible with the CUDA runtime, you may also want to upgrade the host CUDA driver to the latest, like Driver Version: 525.116.03 which supports up to CUDA 11.7, but this is not necessary, since according to NVIDIA compatibility document.

CUDA Toolkit Linux x86_64 Minimum Required Driver Versio Note
CUDA 11.x \(\ge\) 450.80.02* CUDA 11.0 was released with an earlier driver version, but by upgrading to Tesla Recommended Drivers 450.80.02 (Linux) / 452.39 (Windows) as indicated, minor version compatibility is possible across the CUDA 11.x family of toolkits.

One good recipe is as below:

host: NVIDIA A100-PCIE-40Gb, No driver update is necessary, but I prefer to keep it updated to the latest version.

service: PyTorch: 1.13.1, base-image: nvidia/cuda:11.7.1-base-ubuntu20.04

The compatitibilty matrix now passes all checks.

Compatibility Status
CUDA and Base Image
PyTorch and GPU
PyTorch and CUDA
CUDA and GPU

One More Thing

Q: I initiate a container with a image without any CUDA runtime installed inside. Then, after I execute docker run --gpus all <image_name>, I access the container and find all CUDA-related files on the host system, including CUDA runtime api. My assumption is that --gpus all will map all CUDA Toolkits to the CPU image, and thereby turn it to a CUDA runtime image. However, this assumption seems wrong for a container initilized from a CUDA 10.2 runtime base image in the same way, since all applications inside such a container still use CUDA 10.2 runtime API, suggesting that the host system’s CUDA runtime isn’t being mapped into the container. What the hell is going on?

A: When you run a Docker container with the --gpus all flag, you enable that container to access the host’s GPUs. However, this does not mean that all CUDA-related files and libraries from the host are automatically mapped into the container. What happens under the hood may differ based on whether the Docker image itself contains CUDA runtime libraries or not.

Image without CUDA runtime

When you start a container based on an image that doesn’t contain any CUDA runtime libraries, and you use --gpus all, you might observe that certain CUDA functionalities are available in the container. This is often because NVIDIA’s Docker runtime (nvidia-docker) ensures that the minimum necessary libraries and binaries related to the GPU are mounted into the container, including the compatible CUDA driver libraries.

Image with CUDA runtime

If you start a container from an image that already has a specific CUDA runtime version (say, CUDA 10.2), the container will use that version for its operations. NVIDIA’s Docker runtime (nvidia-docker) generally won’t override the CUDA libraries in a container that already has them. The container is designed to be a standalone, consistent environment, and one of the benefits of using containers is that they package the application along with its dependencies, ensuring that it runs the same way regardless of where it’s deployed.

Reference