The Nautilus Cluster provides over 200 GPU nodes. In this section you will request GPUs. Make sure you don’t waste those and delete your pods when not using the GPUs.
Use this definition to create your own pod and deploy it to kubernetes (refer to Basic Kubernetes):
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-example
spec:
containers:
- name: gpu-container
image: gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/prp:latest
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: 1
This example requests 1 GPU device. You can have up to 2 for pods. If you request GPU devices in your pod, kubernetes will auto schedule your pod to the appropriate node. There’s no need to specify the location manually.
You should always delete your pod when your computation is done to let other users use the GPUs.
Consider using Jobs with actual script instead of sleep
whenever possible to ensure your pod is not wasting GPU time.
If you have never used Kubernetes before, see the tutorial.
Certain kinds of GPUs have much higher specs than the others, and to avoid wasting those for regular jobs, your pods will only be scheduled on those if you request the type explicitly.
Currently those include:
*A100 running in MIG mode is not considered high-demand one.
Since 1 and 2 GPU jobs are blocking nodes from getting 4 and 8 GPU jobs, there are some nodes reserved for those. Once you submit a job requesting 4 or 8 GPUs, a controller will automatically add toleration which will allow you to use the node reserved for more GPUs. You don’t need to do anything manually for that.
We have a variety of GPU flavors attached to Nautilus. You can get a list of GPU models from the actual cluster information (f.e. kubectl get nodes -L nvidia.com/gpu.product
).
Credit: GPU types by NRP Nautilus
If you need more graphical memory, use the official specs to choose the type. The table below is an example of the GPU types in the Nautuilus Cluster and their memory size:
GPU Type | Memory size (GB) |
---|---|
NVIDIA-GeForce-GTX-1070 | 8G |
NVIDIA-GeForce-GTX-1080 | 8G |
Quadro-M4000 | 8G |
NVIDIA-A100-PCIE-40GB-MIG-2g.10gb | 10G |
NVIDIA-GeForce-GTX-1080-Ti | 12G |
NVIDIA-GeForce-RTX-2080-Ti | 12G |
NVIDIA-TITAN-Xp | 12G |
Tesla-T4 | 16G |
NVIDIA-A10 | 24G |
NVIDIA-GeForce-RTX-3090 | 24G |
NVIDIA-GeForce-RTX-3090 | 24G |
NVIDIA-TITAN-RTX | 24G |
NVIDIA-RTX-A5000 | 24G |
Quadro-RTX-6000 | 24G |
Tesla-V100-SXM2-32GB | 32G |
NVIDIA-A40 | 48G |
NVIDIA-RTX-A6000 | 48G |
Quadro-RTX-8000 | 48G |
NOTE: Not all nodes are available to all users. You can consult about your available resources in Matrix and on resources page. Labs connecting their hardware to our cluster have preferential access to all our resources.
To use a specific type of GPU, add the affinity definition to your pod yaml file. The example below specifies 1080Ti GPU:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-GeForce-GTX-1080-Ti
To make sure you did everything correctly after you’ve submitted the job, look at the corresponding pod yaml (kubectl get pod ... -o yaml
) and check that resulting nodeAffinity is as expected.
In general the higher CUDA versions support the lower and same driver versions. The nodes are labelled with the major and minor CUDA and driver versions. You can check those at the resources page or list with this command (it will also choose only GPU nodes):
kubectl get nodes -L nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor -l nvidia.com/gpu.product
If you’re using the container image with higher CUDA version, you have to pick the nodes supporting it. Example:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/cuda.runtime.major
operator: In
values:
- "12"
- key: nvidia.com/cuda.runtime.minor
operator: In
values:
- "2"
Also you can choose the driver above something if you know which one you need (this will pick drivers above 535):
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/cuda.driver.major
operator: Gt
values:
- "535"
A100 GPUs allow slicing those into several logical GPUs ( MIG mode ). This mode is enabled in our cluster. Things can change, but currently we’re thinking about slicing those in halves. The current MIG mode can be obtained from nodes via the nvidia.com/gpu.product
label: NVIDIA-A100-PCIE-40GB-MIG-2g.10gb
means 2 compute instances (out of 7 total) and 10GB memory per virtual GPU.