bitflippin.com
Steven's personal website and bit flipping laboratory
2024-07-12

Automation script for VACC jobs

I wrote a script called vaccjob to automate the transfer, submission, monitoring, and output retrieval of jobs on UVM's in-house supercomputing cluster.

Password authentication

The VACC does not support public key authentication, only password authentication. One workaround to prompting for the UVM NetID password mid-script is to put the password in a local file and use the sshpass utility to feed the password to ssh or scp. Install ssh and sshpass if not installed already using the appropriate operating system specific process.

# Optionally create a directory for passwords
mkdir ~/.sshpasswds

# Put password in a file
vim ~/.sshpasswds/uvm

# Optionally lock down access somewhat
chmod 400 ~/.sshpasswds/uvm

# Install utilities (Arch Linux)
sudo pacman -S openssh sshpass

Workflow

Bob the developer has UVM NetID bobsnetid, and a password saved in ~/.sshpasswds/uvm. Bob writes a program he wants run on bluemoon on the VACC. The program and associated files are in the ~/src/vj directory locally. The job script including any necesssary SLURM directives is ~/src/vj/job.sh locally. Bob designed the program to save output files in the ~/remoteout directory. Bob wants the to have any output in the ~/localout directory locally when the job finishes. Bob runs

vaccjob bobsnetid ~/.sshpasswds/uvm ~/src/vj vj/job.sh remoteout ~/localout

Due to its length and cumbersome parameters, Bob may choose to save this line as a script of its own for easy reuse. In the happy path, vaccjob will

  1. Recursively copy ~/vj to Bob's home directory on the VACC
  2. Submit the job by running ~/vj/job.sh
  3. Create a file ~/.vaccjobid locally containing the job ID
  4. Poll the job status every 10 minutes until the job completes
  5. Recursively copy ~/remoteout from the VACC to ~/localout on Bob's computer
  6. Delete ~/.vaccjobid

Bob may need to turn his computer off before his job finishes. Bob can then run vaccjob again later, whereupon vaccjob will notice that ~/.vaccjobid already exists and skip straight to polling. If at any point an error occurs, vaccjob terminates immediately with a message.

Script

Here is the full vaccjob script in its current form. As always please exercise caution with code that interacts with third party systems, and with code that can potentially modify or delete data.

vaccjob
#!/bin/bash
set -e

HOST=vacc-user1.uvm.edu
JIDF=~/.vaccjobid
DELAY=600

if ! [ $# -eq 6 ]; then
   echo "Requires 6 parameters."
   exit 1
fi

if [ -f $JIDF ]; then
   jid=$(cat "$JIDF")
   echo "Resuming job id $jid"
else
   echo "Clearing payload directory on $HOST"
   payloaddir=$(basename "$3")
   sshpass -f "$2" ssh "$1"@"$HOST" "rm -rf $payloaddir"
   echo "Copying payload $3"
   sshpass -f "$2" scp -rC "$3" "$1"@"$HOST":
   echo "Clearing output directory on $HOST"
   clrcmd="mkdir -p $5; rm -rf $5/*"
   sshpass -f "$2" ssh "$1"@"$HOST" "$clrcmd"
   echo "Submitting job $4"
   jobcmd="sbatch $4"
   resp=$(sshpass -f "$2" ssh "$1"@"$HOST" "$jobcmd")
   jid="${resp##* }"
   echo "Job id is $jid"
   echo "$jid" > "$JIDF"
   sleep "$DELAY"
fi

while true; do
   pollcmd="squeue -h -j $jid -o %T"
   set +e
   resp=$(sshpass -f "$2" ssh "$1"@"$HOST" "$pollcmd")
   set -e
   if [ -z "$resp" ]; then
      break
   fi
   echo "$resp"
   sleep "$DELAY"
done

rm $JIDF
echo "Clearing local output directory $6"
rm -rf "$6/*"
echo "Copying output to $6"
sshpass -f "$2" scp -rC "$1"@"$HOST":"$5" "$6"

Shortcomings

Sometimes a harmless error message about an undefined job id appears at the end. I think this happens when the job has been terminated long enough for the SLURM to forget what its id was. Also vaccjob is unaware of whether the job finished successfully or not. It tries to retrieve the output either way.

References