Good HCC Practices
Crane and Rhino, our two high-performance clusters, are shared among all our users.
Sometimes, some users’ activities may negatively impact the clusters and the users.
To avoid this, we provide the following guidelines for good HCC practices.
- Be kind to the login node. The login node is shared among all users and it
should be used only for light tasks, such as moving and editing files, compiling programs,
and submitting and monitoring jobs. If a researcher runs a computationally intensive task
on the login node, that will negatively impact the performance for other users. Moreover, the
resources on the login node are limited, so any lengthy or intensive task will highly likely
exceed these resources and be terminated. For any CPU or memory intensive
operations, such as testing and running applications, one should use an
interactive session, or
submit a job to the batch queue.
- Avoid launching multiple simultaneous processes on the login node. This may include using
lots of threads for compiling applications, or checking the job status multiple times a minute.
- Some I/O intensive jobs may benefit from copying the data to the fast, temporary /scratch
file system local to each worker nodes. The /scratch directories are unique per job, and
are deleted when the job finishes. Thus, the last step of the batch script should copy the
needed output files from /scratch to either /work or /common. Please see the
Running BLAST Alignment page for an example.
- /work has two quotas - one for file count and the second one for disk space.
Reaching these quotas may additionally stress the file system. Therefore, please make sure you
monitor these quotas regularly, and delete all the files that are not needed or copy them to more permanent location.
- /work is intended to be temporary location for storing job outputs and files. After that,
all the necessary files need to be either moved to a permanent storage, or deleted.
- Avoid rapidly opening and closing many files, as well as frequently reading and writing to
disk, in your program. This approach stresses the file system and may cause general issues.
Instead, consider reading and writing large blocks of data in memory over time, or
utilizing more advanced parallel I/O libraries, such as parallel hdf5 and parallel netcdf.
Internal and External Networks
- Use archives to transfer large number of files. If you are performing file transfer of
many small files, please put these files in an archive file format, such that the many files are
replaced by a single file. We recommend using zip as the archive format as zip files keep an
index of the files. Moreover, zip files can be quickly indexed by the various zip tools, and allow
extraction of all files or a subset of files. The tar formats are stream oriented, and a full decompression
is required for the tools to know if the requested files have been found.
- Before you request multiple nodes and cores in your submit script, make sure that the application you are
using supports that. MPI applications can utilize multiple nodes and cores, while threaded or OpenMP applications are
limited to a single node. Misusing this information may harm the researcher’s waiting time in queue, as well as the application performance.
- Threaded and OpenMP applications can utilize multiple cores within a node. However, most of the applications do not
perform significantly better when more than 16 cores are used. On the other hand, requesting more cores increases the
waiting time for resources in queue, so please make sure you request a reasonable number of cores.
- If an application uses multiple threads or cores, that number needs to be specified with the ”–ntasks-per-node”
or ”–ntasks” options of SLURM. If you use multiple threads or cores with your application, but you don’t specify
the respective SLURM options, your application will use only 1 core by default.
- Avoid submitting large number of short (less than half an hour of running time) SLURM jobs. The scheduler spends more
time and memory in processing those jobs, which may cause problems and reduce the scheduler’s responsiveness for everyone.
Instead, group the short tasks into jobs that will run longer.
- The maximum running time on our clusters is 7 days. If your job needs more time than that, please consider
improving the code, splitting the job into smaller tasks, or using checkpointing tools such as DMTCP.
- Before submitting a job, it is recommended to make sure that you are executing the application correctly, you are
passing the right arguments, and you don’t have typos. You can do this using an interactive session.
Otherwise, your job may be waiting for resources to only immediately fail because of typo or missing argument.
- If no memory, time, and core requirements are specified in your submit SLURM script, the default resources allocated are
1GB of RAM, 1 hour of running time, and a single CPU core respectively. Oftentimes, these resources are not enough. If the job
is terminated, there is a high chance that the reason is exceeded resources, so please make sure you set
the memory and time requirements appropriately.
- The run time and memory usage depend heavily on the application and the data used. You can monitor your application’s needs with
tools such as Allinea Performance Reports
and mem_report. While these tools can not predict the needed resources, they can provide
useful information the researcher can use the next time that particular application is run.
We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any
assistance, please contact HCC Support at firstname.lastname@example.org.