Jobs can be submitted to the OSG from Crane, so there is no need to logon to a different submit host or get a grid certificate!
The HTCondor project provides software to schedule individual applications, workflows, and for sites to manage resources. It is designed to enable High Throughput Computing (HTC) on large collections of distributed resources for users and serves as the job scheduler used on the OSG. Jobs are submitted from the Crane login node to the OSG using an HTCondor submission script. For those who are used to submitting jobs with SLURM, there are a few key differences to be aware of:
Executable
line
of the submit script) is transferred automatically with the job.
All other files need to be listed on the transfer_input_files
line (see example below). transfer_input_files
option. If these files do
not exist on the remote
host when the job exits, then the job will not complete successfully
(it will be place in the held state).Arguments
used.
The submit script in the example below queues 5 jobs with the first
set of specified arguments, and 1 job with the second set of
arguments. By default, Queue
when it is not followed by a number
will submit 1 job.For more information and advanced usage, see the HTCondor Manual.
HTCondor, much like Slurm, needs a script to tell it how to do what the user wants. The example below is a basic script in a file say ‘applejob.txt’ that can be used to handle most jobs submitted to HTCondor.
#with executable, stdin, stderr and log
Universe = vanilla
Executable = a.out
Arguments = file_name 12
Output = a.out.out
Error = a.out.err
Log = a.out.log
Queue
The table below explains the various attributes/keywords used in the above script.
Attribute/Keyword | Explanation |
---|---|
# | Lines starting with ‘#’ are considered as comments by HTCondor. |
Universe | is the way HTCondor manages different ways it can run, or what is called in the HTCondor documentation, a runtime environment. The vanilla universe is where most jobs should be run. |
Executable | is the name of the executable you want to run on HTCondor. |
Arguments | are the command line arguments for your program. For example, if one was to run ls -l / on HTCondor. The Executable would be ls and the Arguments would be -l / . |
Output | is the file where the information printed to stdout will be sent. |
Error | is the file where the information printed to stderr will be sent. |
Log | is the file where information about your HTCondor job will be sent. Information like if the job is running, if it was halted or, if running in the standard universe, if the file was check-pointed or moved. |
Queue | is the command to send the job to HTCondor’s scheduler. |
Suppose you would like to submit a job e.g. a Monte-Carlo simulation, where the same program needs to be run several times with the same parameters the script above can be used with the following modification.
Modify the Queue
command by giving it the number of times the job must
be run (and hence queued in HTCondor). Thus if the Queue
command is
changed to Queue 5
, a.out will be run 5 times with the exact same
parameters.
In another scenario if you would like to submit the same job but with
different parameters, HTCondor accepts files with multiple Queue
statements. Only the parameters that need to be changed should be
changed in the HTCondor script before calling the Queue
.
Please see “A simple example ” in next chapter for the detail use of
$(Process)
#with executable, stdin, stderr and log
#and multiple Argument parameters
Universe = vanilla
Executable = a.out
Arguments = file_name 10
Output = a.out.$(Process).out
Error = a.out.$(Process).err
Log = a.out.$(Process).log
Queue
Arguments = file_name 20
Queue
Arguments = file_name 30
Queue
The steps below describe how to submit a job and other important job management tasks that you may need in order to monitor and/or control the submitted job:
How to submit a job to OSG - assuming that you named your HTCondor script as a file applejob.txt
[apple@login.crane ~] $ condor_submit applejob
You will see the following output after submitting the job
Submitting job(s)
......
6 job(s) submitted to cluster 1013038
How to view your job status - to view the job status of your
submitted jobs use the following shell command
Please note that by providing a user name as an argument to the
condor_q
command you can limit the list of submitted jobs to the
ones that are owned by the named user
[apple@login.crane ~] $ condor_q apple
The code section below shows a typical output. You may notice that the column ST represents the status of the job (H: Held and I: Idle or waiting)
-- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1013034.4 apple 3/26 16:34 0+00:21:00 H 0 0.0 sjrun.py INPUT/INP
1013038.0 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.1 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.2 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.3 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
...
16 jobs; 0 completed, 0 removed, 12 idle, 0 running, 4 held, 0 suspended
How to release a job - in a few cases a job may get held because of reasons such as authentication failure or other non-fatal errors, in those cases you may use the shell command below to release the job from the held status so that it can be rescheduled by the HTCondor.
Release one job:
[apple@login.crane ~] $ condor_release 1013034.4
Release all jobs of a user apple:
[apple@login.crane ~] $ condor_release apple
How to delete a submitted job - if you want to delete a submitted job you may use the shell commands as listed below
Delete one job:
[apple@login.crane ~] $ condor_rm 1013034.4
Delete all jobs of a user apple:
[apple@login.crane ~] $ condor_rm apple
How to get help form HTCondor command
You can use man to get detail explanation of HTCondor command
[apple@glidein ~]man condor_q
man condor_q
just-man-pages/condor_q(1) just-man-pages/condor_q(1)
Name
condor_q Display information about jobs in queue
Synopsis
condor_q [ -help ]
condor_q [ -debug ] [ -global ] [ -submitter submitter ] [ -name name ] [ -pool centralmanagerhost-
name[:portnumber] ] [ -analyze ] [ -run ] [ -hold ] [ -globus ] [ -goodput ] [ -io ] [ -dag ] [ -long ]
[ -xml ] [ -attributes Attr1 [,Attr2 ... ] ] [ -format fmt attr ] [ -autoformat[:tn,lVh] attr1 [attr2
...] ] [ -cputime ] [ -currentrun ] [ -avgqueuetime ] [ -jobads file ] [ -machineads file ] [ -stream-
results ] [ -wide ] [ {cluster | cluster.process | owner | -constraint expression ... } ]
Description
condor_q displays information about jobs in the Condor job queue. By default, condor_q queries the local
job queue but this behavior may be modified by specifying:
* the -global option, which queries all job queues in the pool
* a schedd name with the -name option, which causes the queue of the named schedd to be queried