MPI Jobs | H. Milton Stewart School of Industrial and Systems Engineering

Condor supports high-speed distributed parallel computing using OpenMPI. MPI jobs require use of the dedicated queue on hooke, and can't be submitted from wren.

MPI jobs are run on the Boyle nodes, consisting of 16 eight-core machines with a low-latency Infiniband cross-connect between the nodes. OpenMPI jobs automatically use the fastest available transport method for message passing between processes.

Condor handles MPI jobs via a front end wrapper script called ompiscript (full path: /usr/local/bin/ompiscript). This wrapper is submitted as the executable name, and the submitter's MPI executable becomes the first command line argument. The wrapper automates all of the usual tasks associated with starting an MPI job: creates a hosts file, does the "mpirun" command, and takes care of passwordless ssh access.

Of course it remains crucial that users make sure their MPI code works in the shell before submitting it to Condor.

Submit file examples follow.

Example 1: running on any available nodes

The generic form of MPI submission will run on any available CPU (currently this means just on the boyle nodes). Contact IT first if you intend to submit an MPI job to more than 32 CPUs.

universe = parallel
executable = /usr/local/bin/ompiscript
arguments = exe1 argv1 argv2 ....
getenv = True          ## needed for your env to be present on execute nodes
output = exe1.out   ## stdout
error = exe.err    ## stderr
machine_count = 32
queue

If you want separate output files for each process, you can use a macro like this:

output = out/exe1.out.$(NODE)
error = out/exe1.err.$(NODE)

Example 2: Running a job on a specific node

If a job is run on 8 CPUs it can be kept on a single node, which will produce slightly faster interprocess communication. The user will need to find a node with no jobs currently running on it and specify that node in the command file.

universe = parallel
executable = /usr/local/bin/ompiscript
arguments = exe2 argv1 argv2 ....
getenv = True         
output = exe2.out
error = exe2.err
machine_count = 8
queue

H. Milton Stewart School of Industrial and Systems Engineering

College of Engineering

Search

Example 1: running on any available nodes

Example 2: Running a job on a specific node

Search

H. Milton Stewart School of Industrial and Systems Engineering