Dedicated Queue | H. Milton Stewart School of Industrial and Systems Engineering

Condor's primary queue is oriented towards cycle harvesting. This means that jobs running in the vanilla universe on Wren are subject to preemption for one of several reasons.

Some users need to run long jobs without checkpointing, where preemption means restarting from the beginning and losing days of computing time. As a practical matter it can often mean never finishing the job at all.

To alleviate this problem we have provided a dedicated queue. This queue runs from a different submission machine, hooke (hooke.isye.gatech.edu). The queue does have a disadvantage in that it has fewer assigned machines and you may have to wait longer for a slot to open up. However, if you have just a few long-running jobs, it's the way to go.

Submitting from hooke requires only a couple of changes.

First, obviously, you must log into hooke to submit the job. This can be done simply by doing "ssh hooke" whenever your code is ready to go. You don't need to do any of the development on hooke; that should be done on wren and then just log into hooke when you're ready to submit. Once you've submitted the job, you will need to monitor it on hooke as well. Wren's queue and history will not contain the details of the job.

Finally, the submit file requires a couple of changes. The universe should be changed to "parallel", and the line "machine count = 1" needs to be added before the"queue" statement.

So an example submit file for the dedicated queue from hooke looks like this:

universe   = parallel
executable = mycode
output     = mycode.out
error      = mycode.err
arguments  = 10 20
machine_count = 1
queue

Once you have this in place you can submit the job normally via "condor_submit", and the job should normally run to completion rather than being subject to preemption.

H. Milton Stewart School of Industrial and Systems Engineering

College of Engineering

Search

Search

H. Milton Stewart School of Industrial and Systems Engineering