Automation script for VACC jobs
I wrote a script called vaccjob
to automate the transfer, submission, monitoring, and output retrieval of jobs on UVM's in-house supercomputing cluster.
Password authentication
The VACC does not support public key authentication, only password authentication. One workaround to prompting for the UVM NetID password mid-script is to put the password in a local file and use the sshpass
utility to feed the password to ssh
or scp
. Install ssh
and sshpass
if not installed already using the appropriate operating system specific process.
# Optionally create a directory for passwords mkdir ~/.sshpasswds # Put password in a file vim ~/.sshpasswds/uvm # Optionally lock down access somewhat chmod 400 ~/.sshpasswds/uvm # Install utilities (Arch Linux) sudo pacman -S openssh sshpass
Workflow
Bob the developer has UVM NetID bobsnetid, and a password saved in ~/.sshpasswds/uvm
. Bob writes a program he wants run on bluemoon
on the VACC. The program and associated files are in the ~/src/vj
directory locally. The job script including any necesssary SLURM directives is ~/src/vj/job.sh
locally. Bob designed the program to save output files in the ~/remoteout
directory. Bob wants the to have any output in the ~/localout
directory locally when the job finishes. Bob runs
vaccjob bobsnetid ~/.sshpasswds/uvm ~/src/vj vj/job.sh remoteout ~/localout
Due to its length and cumbersome parameters, Bob may choose to save this line as a script of its own for easy reuse. In the happy path, vaccjob
will
- Recursively copy
~/vj
to Bob's home directory on the VACC - Submit the job by running
~/vj/job.sh
- Create a file
~/.vaccjobid
locally containing the job ID - Poll the job status every 10 minutes until the job completes
- Recursively copy
~/remoteout
from the VACC to~/localout
on Bob's computer - Delete
~/.vaccjobid
Bob may need to turn his computer off before his job finishes. Bob can then run vaccjob
again later, whereupon vaccjob
will notice that ~/.vaccjobid
already exists and skip straight to polling. If at any point an error occurs, vaccjob
terminates immediately with a message.
Script
Here is the full vaccjob
script in its current form. As always please exercise caution with code that interacts with third party systems, and with code that can potentially modify or delete data.
#!/bin/bash set -e HOST=vacc-user1.uvm.edu JIDF=~/.vaccjobid DELAY=600 if ! [ $# -eq 6 ]; then echo "Requires 6 parameters." exit 1 fi if [ -f $JIDF ]; then jid=$(cat "$JIDF") echo "Resuming job id $jid" else echo "Clearing payload directory on $HOST" payloaddir=$(basename "$3") sshpass -f "$2" ssh "$1"@"$HOST" "rm -rf $payloaddir" echo "Copying payload $3" sshpass -f "$2" scp -rC "$3" "$1"@"$HOST": echo "Clearing output directory on $HOST" clrcmd="mkdir -p $5; rm -rf $5/*" sshpass -f "$2" ssh "$1"@"$HOST" "$clrcmd" echo "Submitting job $4" jobcmd="sbatch $4" resp=$(sshpass -f "$2" ssh "$1"@"$HOST" "$jobcmd") jid="${resp##* }" echo "Job id is $jid" echo "$jid" > "$JIDF" sleep "$DELAY" fi while true; do pollcmd="squeue -h -j $jid -o %T" set +e resp=$(sshpass -f "$2" ssh "$1"@"$HOST" "$pollcmd") set -e if [ -z "$resp" ]; then break fi echo "$resp" sleep "$DELAY" done rm $JIDF echo "Clearing local output directory $6" rm -rf "$6/*" echo "Copying output to $6" sshpass -f "$2" scp -rC "$1"@"$HOST":"$5" "$6"
Shortcomings
Sometimes a harmless error message about an undefined job id appears at the end. I think this happens when the job has been terminated long enough for the SLURM to forget what its id was. Also vaccjob
is unaware of whether the job finished successfully or not. It tries to retrieve the output either way.
References
- How to connect to the VACC, write job scripts, install software dependencies, submit jobs, monitor jobs, and retrieve output. Contains links to pages that detail each step.
- Detailed instructions on how to use
sshpass
, including examples. How to installsshpass
on many operating systems. - At a high level, how to use
scp
and some of its switches. - Good practices for writing and structuring
bash
scripts. How to support command line switches and argments inbash
scripts, including examples.