Using LLMs on HCC resources

Large language models (LLMs) are large models that are pre-trained on vast amounts of data. LLMs can be used on different ways on HCC resources.

Open OnDemand Apps¶

LM Studio is available as Open OnDemand App. While LM Studio can utilize both CPU and GPU nodes, it is recommended to request a GPU such that the models run faster.

Note

When using LM Studio, please make sure that the CPU Thread Pool Size value matches the number of requested cores and the GPU Offload checkbox is selected. The slider of GPU Offload is set to the maximum number of layers that can be loaded into GPU memory. These can be found in the Settings tab from the LM Studio GUI.

System-wide modules¶

We currently provide Ollama as system-wide module on Swan. This module can be loaded with:
```
module purge
module load ollama/0.11
```
Examples of Ollama SLURM submit scripts can be found here.

Models¶

Using system-wide models¶

There are multiple Ollama models downloaded system-wide on Swan and accessible via the mldata module:

Platform	Maker	Model
Ollama	Meta	llama 3, llama 3.1, llama 3.2, llama 3.3, llama 4, codellama
Ollama	Google	gemma, gemma2, gemma3, gemma3n, codegemma
Ollama	OpenAI	gpt-oss
Ollama	Microsoft	phi, phi3, phi3.5, phi4, phi4-reasoning, phi4-mini
Ollama	Dolphin	dolphin3
Ollama	Mistral	mistral, mistral-nemo, mistral-small, mistral-small3.1, mistral-small3.2
Ollama	Nous Research	hermes3
Ollama	DeepSeek	deepseek-r1, deepseek-v3, deepseek-v3.1, deepseek-coder, deepseek-coder-v2
Ollama	Qwen	qwen2, qwen2.5, qwen2.5-coder, qwen2.5-math, qwen2.5vl, qwen3, qwen3-coder

These models can be accessed with:

module purge
module load ollama/0.11
module load mldata/1.0

that properly sets the OLLAMA_MODELS variable.

Downloading models¶

You can download various models with both LM Studio and Ollama. The available models can be found on their respective websites:

Note

By default, the download location for Ollama models is $HOME/.ollama. This location can be changed with the OLLAMA_MODELS variable. For example, to use $NRDSTOR for the Ollama models, please use:

export OLLAMA_MODELS=$NRDSTOR/OLLAMA_MODELS

Note

By default, the download location for LM Studio models is $HOME/.lm_studio. This location can be changed within the LM Studio OOD App by navigating to the My Models tab and choosing a new directory with Models Directory.

Please note that the $WORK file-system has a purge policy and both $WORK and $NRDSTOR are not backed up.

Benchmarking¶

This summarizes benchmark tests of OpenAI's gpt-oss 120B model across various Swan HPC nodes with different GPU configurations. The tests used a consistent query (\'a recipe of Thanksgiving turkey.\') with fixed settings (context length 4096, batch size 512, reasoning effort Medium, 4 experts, flash attention disabled).

Performance by GPU Setup¶

GPU Setup	Thinking Time (s)	Throughput (tok/s)	TTFT (s)	Notes
2×V100S (32GB)	5.43	10.44	33.9	Usable but slow
4×V100S (16GB)	9.80	9.66	32.7	Usable but slow
2×A30 (24GB)	23.85–34.64	1.97–2.93	2.9–4.3	Severe bottleneck
4×A30 (24GB)	1.95	43.16	0.82	Good performance

Quick Takeaways¶

Good Choice: 4×A30 (24GB) ~43 tok/s.
Legacy GPUs: V100S usable but ~5-10× slower.

GPU Offloading Formula¶

The benchmark above is based on the OpenAI's gpt-oss 120B model. For other models, you can check their corresponding VRAM requirements when using GGUF. The optimal GPU setup is one that provides enough memory to load the entire model into GPU memory.

If loading the full model is not practical—or if you prefer shorter queue times, since larger GPU requests often lead to longer waits—you can instead ensure that the combined GPU and CPU memory exceeds the model's VRAM requirement. In LM Studio, the number of model layers that can be offloaded to GPU memory can then be estimated using the following formula:

Max Layers Offloaded = ⌊ nb_LLM_layer × GPU_RAM / model_size ⌋

Where:

nb_LLM_layer is the total number of layers in the model
GPU_RAM is the available GPU memory (in GB)
model_size is the total size of the model (in GB)