Submitting Jobs

Swan is managed by the SLURM resource manager.
In order to run processing on Swan, you must create a SLURM script that will run your processing. After submitting the job, SLURM will schedule your processing on an available worker node.

Before writing a submit file, you may need to compile your application.

Ensure proper working directory for job output
Creating a SLURM Submit File
Submitting the job
Checking Job Status
Checking Job Start
Removing the Job
Next Steps

Ensure proper working directory for job output¶

Manual specification of /work path

$ cd /work/[groupname]/[username]

The environment variable $WORK can also be used.

Using environment variable for /work path

$ cd $WORK
$ pwd
/work/[groupname]/[username]

Review how /work differs from /home here.

Creating a SLURM Submit File¶

Note

The below example is for a serial job. For submitting MPI jobs, please look at the MPI Submission Guide.

A SLURM submit file is broken into 2 sections, the job description and the processing. SLURM job description are prepended with #SBATCH in the submit file.

SLURM Submit File

#!/bin/bash
#SBATCH --time=03:15:00          # Run time in hh:mm:ss
#SBATCH --mem-per-cpu=1024       # Maximum memory required per CPU (in megabytes)
#SBATCH --job-name=hello-world
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out

module load example/test

hostname
sleep 60

time
Maximum walltime the job can run. After this time has expired, the job will be stopped.
mem-per-cpu
Memory that is allocated per core for the job. If you exceed this memory limit, your job will be stopped.
mem
Specify the real memory required per node in MegaBytes. If you exceed this limit, your job will be stopped. Note that for you should ask for less memory than each node actually has. For Swan, the max is 2000GB.
job-name The name of the job. Will be reported in the job listing.
partition
The partition the job should run in. Partitions determine the job's priority and on what nodes the partition can run on. See the Partitions page for a list of possible partitions.
error
Location of the stderr will be written for the job. [groupname] and [username] should be replaced your group name and username. Your username can be retrieved with the command id -un and your group with id -ng.
output
Location of the stdout will be written for the job.

More advanced submit commands can be found on the SLURM Docs. You can also find an example of a MPI submission on Submitting an MPI Job.

Submitting the job¶

Submitting the SLURM job is done by command sbatch. SLURM will read the submit file, and schedule the job according to the description in the submit file.

Submitting the job described above is:

SLURM Submission

$ sbatch example.slurm
Submitted batch job 24603

The job was successfully submitted.

Checking Job Status¶

Job status is found with the command squeue. It will provide information such as:

The State of the job:
- R - Running
- PD - Pending - Job is awaiting resource allocation.
- Additional codes are available on the squeue page.
Job Name
Run Time
Nodes running the job

Checking the status of the job is easiest by filtering by your username, using the -u option to squeue.

$ squeue -u <username>
  JOBID PARTITION     NAME       USER  ST       TIME  NODES NODELIST(REASON)
  24605     batch hello-wo <username>   R       0:56      1 b01

Additionally, if you want to see the status of a specific partition, for example if you are part of a partition, you can use the -p option to squeue:

$ squeue -p guest
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  73435  guest MyRandom demo01   R   10:35:20      1 ri19n10
  73436  guest MyRandom demo01   R   10:35:20      1 ri19n12
  73735  guest SW2_driv   demo02   R   10:14:11      1 ri20n07
  73736  guest SW2_driv   demo02   R   10:14:11      1 ri20n07

Checking Job Start¶

You may view the start time of your job with the command squeue --start. The output of the command will show the expected start time of the jobs.

$ squeue --start --user demo03
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)
   5822     batch  python   demo03  PD  2013-06-08T00:05:09      3 (Priority)
   5823     batch  python   demo03  PD  2013-06-08T00:07:39      3 (Priority)
   5824     batch  python   demo03  PD  2013-06-08T00:09:09      3 (Priority)
   5825     batch  python   demo03  PD  2013-06-08T00:12:09      3 (Priority)
   5826     batch  python   demo03  PD  2013-06-08T00:12:39      3 (Priority)
   5827     batch  python   demo03  PD  2013-06-08T00:12:39      3 (Priority)
   5828     batch  python   demo03  PD  2013-06-08T00:12:39      3 (Priority)
   5829     batch  python   demo03  PD  2013-06-08T00:13:09      3 (Priority)
   5830     batch  python   demo03  PD  2013-06-08T00:13:09      3 (Priority)
   5831     batch  python   demo03  PD  2013-06-08T00:14:09      3 (Priority)
   5832     batch  python   demo03  PD                  N/A      3 (Priority)

The output shows the expected start time of the jobs, as well as the reason that the jobs are currently idle (in this case, low priority of the user due to running numerous jobs already).

Removing the Job¶

Removing the job is done with the scancel command. The only argument to the scancel command is the job id. For the job above, the command is:

$ scancel 24605

Next Steps¶

Looking to reduce your wait time on Swan?

HCC wants to hear more about your research! If you acknowledge HCC in your publications, posters, or journal articles, you can receive a boost in priority on Swan!

Details on the process and requirements are available in the HCC Acknowledgement Credit documentation page.