This document provides a comprehensive guide to monitoring GPU usage and optimizing GPU performance on the HCC. Its goal is to help you identify GPU bottlenecks in your jobs and offer instructions for optimizing GPU resource utilization.
You can use the nvidia-smi
command to monitor GPU usage in real time. This tool provides details on GPU memory usage and utilization. To monitor a job, you need access to the same node where the job is running.
If the job to be monitored is using all available resources for a node, the user will not be able to obtain a simultaneous interactive job.
Once the job has been submitted and is running, you can request an interactive session on the same node using the following srun command:
srun --jobid=<JOB_ID> --pty bash
where <JOB_ID>
is replaced by the job ID for the monitored job as assigned by SLURM.
After getting access to the node, use the following command to monitor GPU performance in real time:
watch -n 1 nvidia-smi
Note that nvidia-smi
only shows the process ID (PID
) of the running GPU jobs. If multiple jobs are running on the same node, you’ll need to match the PID
to your job using the top command. Start the top command as follows:
top
In top, the PID
appears in the first column, and your login ID is shown in the USER
column. Use this to identify the process corresponding to your job.
You can use nvidia-smi
to periodically log GPU usage in CSV files for later analysis. This is convenient to be added in the SLURM submit script instead of running it interactively as shown above. To do this, wrap your job command with the following in your SLURM submission script. This will generate three files in your $WORK
directory:
gpu_usage_log.csv
: contains overall GPU performance data, including GPU utilization, memory utilization, and total GPU memory.pid_gpu_usage_log.csv
: logs GPU usage for each process, including the process ID (PID) and GPU memory used by each process. Note that, to match a specific PID with overall GPU performance in the generated file, use the GPU bus ID.pid_lookup.txt
: provides the process ID to help identify which one corresponds to your job in the GPU records.Note that the job ID will be appended to the file names to help match the logs with your specific job.
curpath=`pwd`
cd $WORK
nohup nvidia-smi --query-gpu=timestamp,index,gpu_bus_id,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -f gpu_usage_log.csv-$SLURM_JOB_ID -l 1 > /dev/null 2>&1 &
gpumonpid=$!
nohup nvidia-smi --query-compute-apps=timestamp,gpu_bus_id,pid,used_memory --format=csv -f pid_gpu_usage_log-$SLURM_JOB_ID.csv -l 1 > /dev/null 2>&1 &
gpumonprocpid=$!
nohup top -u <LOGIN-ID> -d 10 -c -b -n 2 > pid_lookup-$SLURM_JOB_ID.txt 2>&1 &
cd $curpath
<YOUR_JOB_COMMAND>
kill $gpumonpid
kill $gpumonprocpid
where <LOGIN-ID>
is replaced by your HCC login ID and <YOUR_JOB_COMMAND>
is replaced by your job command. A complete example SLURM submit script that utilizes this approach can be found here.
If your deep learning job utilizes libraries such as TensorFlow
or PyTorch
, you can use TensorBoard to monitor and visualize GPU usage metrics, including GPU utilization, memory consumption, and model performance. TensorBoard provides real-time insights into how your job interacts with the GPU, helping you optimize performance and identify bottlenecks.
To monitor GPU usage with TensorBoard
, refer to the specific instructions of TensorFlow
or PyTorch
to enable logging with TensorBoard
in your job code:
TensorFlow
- TensorFlow Profiler GuidePyTorch
- PyTorch Profiler with TensorBoardOn Swan, TensorBoard is available as Open OnDemand App.
Improving GPU utilization means maximizing both the computational and memory usage of the GPU to ensure your program fully utilizes GPU’s processing power. Low utilization can result from various bottlenecks, including improper parallelism, insufficient memory management, or CPU-GPU communication overhead.
The GPU is powerful because its parallel processing capabilities. Your job should leverage parallelism effectively:
Shared memory is a small, high-speed memory located on the GPU. It can be used to reduce global memory access latency by storing frequently used data. Use shared memory to cache data that is repeatedly accessed by multiple threads.
Memory divergence occurs when threads in a warp access non-contiguous memory locations, resulting in multiple memory transactions. To minimize divergence:
Excessive memory use can lead to spills into slower global memory. Minimize your program’s memory footprint by:
Data transfer between the CPU and GPU is often a bottleneck in scientific programs. It is essential to minimize these transfers to improve overall GPU performance. Here are some tips:
In deep learning, GPUs are a key component for accelerating model training and inference due to their ability to handle large-scale matrix operations and parallelism. Below are tips to maximize GPU utilization in deep learning jobs.
Batch size refers to the number of training samples processed simultaneously. Larger batch sizes improve GPU utilization by increasing the workload per step. The batch size should fit within the GPU’s memory constraints:
Data loading can become a bottleneck, causing the GPU to idle while waiting for data.
DataLoader
or TensorFlow’s tf.data
pipeline).Model architecture impacts the GPU utilization. Here are some optimization tips:
Memory transfers between CPU and GPU can be expensive, and excessive data movement can reverse the performance gains offered by parallelism on GPU.
There are highly optimized libraries available for GPU-accelerated algorithms, such as linear algebra and FFTs. Always check for these libraries before implementing your own solution, as they are often more efficient and reliable.
GPU computation errors can lead to silent failures, making debugging extremely difficult. For example, insufficient memory on the GPU or illegal memory access can go undetected without proper error handling.
Many programs are initially designed for single-GPU execution and lack support for multiple GPUs. Make sure your program is optimized for multi-GPU execution before scaling up to request multiple GPU resources.