Using LLMs on HCC resources
Large language models (LLMs) are large models that are pre-trained on vast amounts of data. LLMs can be used on different ways on HCC resources.
Open OnDemand Apps¶
- LM Studio is available as Open OnDemand App. While LM Studio can utilize both CPU and GPU nodes, it is recommended to request a GPU such that the models run faster.
Note
When using LM Studio, please make sure that the CPU Thread Pool Size
value matches the number of requested cores and the GPU Offload
checkbox is selected. The slider of GPU Offload
is set to the maximum number of layers that can be loaded into GPU memory. These can be found in the Settings
tab from the LM Studio GUI.
Benchmarking¶
This summarizes benchmark tests of OpenAI's gpt-oss 120B model across various Swan HPC nodes with different GPU configurations. The tests used a consistent query (\'a recipe of Thanksgiving turkey.\') with fixed settings (context length 4096, batch size 512, reasoning effort Medium, 4 experts, flash attention disabled).
Performance by GPU Setup¶
GPU Setup | Thinking Time (s) | Throughput (tok/s) | TTFT (s) | Notes |
---|---|---|---|---|
2×V100S (32GB) | 5.43 | 10.44 | 33.9 | Usable but slow |
4×V100S (16GB) | 9.80 | 9.66 | 32.7 | Usable but slow |
2×A30 (24GB) | 23.85–34.64 | 1.97–2.93 | 2.9–4.3 | Severe bottleneck |
4×A30 (24GB) | 1.95 | 43.16 | 0.82 | Good performance |
Quick Takeaways¶
- Good Choice: 4×A30 (24GB) ~43 tok/s.
- Legacy GPUs: V100S usable but ~5-10× slower.
GPU Offloading Formula¶
The benchmark above is based on the OpenAI's gpt-oss 120B model. For other models, you can check their corresponding VRAM requirements when using GGUF. The optimal GPU setup is one that provides enough memory to load the entire model into GPU memory.
If loading the full model is not practical—or if you prefer shorter queue times, since larger GPU requests often lead to longer waits—you can instead ensure that the combined GPU and CPU memory exceeds the model's VRAM requirement. In LM Studio, the number of model layers that can be offloaded to GPU memory can then be estimated using the following formula:
Max Layers Offloaded = ⌊ nb_LLM_layer × GPU_RAM / model_size ⌋
Where:
- nb_LLM_layer is the total number of layers in the model
- GPU_RAM is the available GPU memory (in GB)
- model_size is the total size of the model (in GB)
System-wide modules¶
- We currently provide Ollama as system-wide module on Swan. This module can be loaded with:
Examples of Ollama SLURM submit scripts can be found here.
module purge module load ollama/0.11
Downloading models¶
You can download various models with both LM Studio and Ollama. The available models can be found on their respective websites:
Note
By default, the download location for Ollama models is $HOME/.ollama
. This location can be changed with the OLLAMA_MODELS
variable. For example, to use $NRDSTOR for the Ollama models, please use:
export OLLAMA_MODELS=$NRDSTOR/OLLAMA_MODELS
Note
By default, the download location for LM Studio models is $HOME/.lm_studio
. This location can be changed within the LM Studio OOD App by navigating to the My Models
tab and choosing a new directory with Models Directory
.
Please note that the $WORK file-system has a purge policy and both $WORK and $NRDSTOR are not backed up.