How to submit an OSG job with HTCondor

Jobs can be submitted to the OSG from Swan, so there is no need to logon to a different submit host or get a grid certificate!

What is HTCondor?

The HTCondor project provides software to schedule individual applications, workflows, and for sites to manage resources.  It is designed to enable High Throughput Computing (HTC) on large collections of distributed resources for users and serves as the job scheduler used on the OSG.  Jobs are submitted from the Swan login node to the OSG using an HTCondor submission script.  For those who are used to submitting jobs with SLURM, there are a few key differences to be aware of:

When using HTCondor

  • All files (scripts, code, executables, libraries, etc) that are needed by the job are transferred to the remote compute site when the job is scheduled.  Therefore, all of the files required by the job must be specified in the HTCondor submit script.  Paths can be absolute or relative to the local directory from which the job is submitted.  The main executable (specified on the Executable line of the submit script) is transferred automatically with the job.  All other files need to be listed on the transfer_input_files line (see example below). 
  • All files that are created by the job on the remote host will be transferred automatically back to the submit host when the job has completed.  This includes temporary/scratch and intermediate files that are not removed by your job.  If you do not want to keep these files, clean up the work space on the remote host by removing these files before the job exits (this can be done using a wrapper script for example). Specific output file names can be specified with the transfer_input_files option.  If these files do not exist on the remote host when the job exits, then the job will not complete successfully (it will be place in the held state).
  • HTCondor scripts can queue (submit) as many jobs as you like.  All jobs queued from a single submit script will be identical except for the Arguments used.  The submit script in the example below queues 5 jobs with the first set of specified arguments, and 1 job with the second set of arguments.  By default, Queue when it is not followed by a number will submit 1 job.

For more information and advanced usage, see the HTCondor Manual.

Creating an HTCondor Script

HTCondor, much like Slurm, needs a script to tell it how to do what the user wants. The example below is a basic script in a file say ‘applejob.txt’ that can be used to handle most jobs submitted to HTCondor.

Example of a HTCondor script
#with executable, stdin, stderr and log
Universe = vanilla
Executable = a.out
Arguments = file_name 12
Output = a.out.out
Error = a.out.err
Log = a.out.log
Queue

The table below explains the various attributes/keywords used in the above script.

Attribute/Keyword Explanation
# Lines starting with ‘#’ are considered as comments by HTCondor.
Universe is the way HTCondor manages different ways it can run, or what is called in the HTCondor documentation, a runtime environment. The vanilla universe is where most jobs should be run.
Executable is the name of the executable you want to run on HTCondor.
Arguments are the command line arguments for your program. For example, if one was to run ls -l / on HTCondor. The Executable would be ls and the Arguments would be -l /.
Output is the file where the information printed to stdout will be sent.
Error is the file where the information printed to stderr will be sent.
Log is the file where information about your HTCondor job will be sent. Information like if the job is running, if it was halted or, if running in the standard universe, if the file was check-pointed or moved.
Queue is the command to send the job to HTCondor’s scheduler.

Suppose you would like to submit a job e.g. a Monte-Carlo simulation, where the same program needs to be run several times with the same parameters the script above can be used with the following modification.

Modify the Queue command by giving it the number of times the job must be run (and hence queued in HTCondor). Thus if the Queue command is changed to Queue 5, a.out will be run 5 times with the exact same parameters.

In another scenario if you would like to submit the same job but with different parameters, HTCondor accepts files with multiple Queue statements. Only the parameters that need to be changed should be changed in the HTCondor script before calling the Queue.

Please see “A simple example ” in next chapter for the detail use of $(Process)

Another Example of a HTCondor script
#with executable, stdin, stderr and log
#and multiple Argument parameters
Universe = vanilla
Executable = a.out
Arguments = file_name 10
Output = a.out.$(Process).out
Error = a.out.$(Process).err
Log = a.out.$(Process).log
Queue
Arguments = file_name 20
Queue
Arguments = file_name 30
Queue

How to Submit and View Your job

The steps below describe how to submit a job and other important job management tasks that you may need in order to monitor and/or control the submitted job:

  1. How to submit a job to OSG - assuming that you named your HTCondor script as a file applejob.txt

    [apple@login.swan ~] $ condor_submit applejob

    You will see the following output after submitting the job

    Example of condor_submit
    Submitting job(s)
    ......
    6 job(s) submitted to cluster 1013038
    

  2. How to view your job status - to view the job status of your submitted jobs use the following shell command Please note that by providing a user name as an argument to the condor_q command you can limit the list of submitted jobs to the ones that are owned by the named user

    [apple@login.swan ~] $ condor_q apple

    The code section below shows a typical output. You may notice that the column ST represents the status of the job (H: Held and I: Idle or waiting)

    Example of condor_q
    -- Schedd: login.swan.hcc.unl.edu : <129.93.227.113:9619?...
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
    1013034.4   apple       3/26 16:34   0+00:21:00 H  0   0.0  sjrun.py INPUT/INP
    1013038.0   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    1013038.1   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    1013038.2   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    1013038.3   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    ...
    16 jobs; 0 completed, 0 removed, 12 idle, 0 running, 4 held, 0 suspended
    
  3. How to release a job - in a few cases a job may get held because of reasons such as authentication failure or other non-fatal errors, in those cases you may use the shell command below to release the job from the held status so that it can be rescheduled by the HTCondor.

    Release one job:

    [apple@login.swan ~] $ condor_release 1013034.4

    Release all jobs of a user apple:

    [apple@login.swan ~] $ condor_release apple

  4. How to delete a  submitted job - if you want to delete a submitted job you may use the shell commands as listed below

    Delete one job:

    [apple@login.swan ~] $ condor_rm 1013034.4

    Delete all jobs of a user apple:

    [apple@login.swan ~] $ condor_rm apple

  5. How to get help form HTCondor command

    You can use man to get detail explanation of HTCondor command

    Example of help of condor_q
    [apple@glidein ~]man condor_q
    
    Output of man condor_q
    just-man-pages/condor_q(1)                          just-man-pages/condor_q(1)
    Name
           condor_q Display information about jobs in queue
    Synopsis
           condor_q [ -help ]
           condor_q  [  -debug  ]  [  -global ] [ -submitter submitter ] [ -name name ] [ -pool centralmanagerhost-
           name[:portnumber] ] [ -analyze ] [ -run ] [ -hold ] [ -globus ] [ -goodput ] [ -io ] [ -dag ] [ -long  ]
           [  -xml  ]  [ -attributes Attr1 [,Attr2 ... ] ] [ -format fmt attr ] [ -autoformat[:tn,lVh] attr1 [attr2
           ...]  ] [ -cputime ] [ -currentrun ] [ -avgqueuetime ] [ -jobads file ] [ -machineads file ] [  -stream-
           results ] [ -wide ] [ {cluster | cluster.process | owner | -constraint expression ... } ]
    Description
           condor_q displays information about jobs in the Condor job queue. By default, condor_q queries the local
           job queue but this behavior may be modified by specifying:
              * the -global option, which queries all job queues in the pool
              * a schedd name with the -name option, which causes the queue of the named schedd to be queried
    

    Next: A simple example of submitting an HTCondorjob