2025-03-11

Grid search implementation on the University of Vermont's supercomputing cluster. Overview, code snippets, design decisions, and open questions. I don't think we're in AWS land anymore. Thanks to the VACC team for their help with the project.

VACC grid search scripting

The VACC is a supercomputing cluster at UVM. Authorized students, faculty, and other affiliates can run computational tasks that would be too expensive for a laptop there. Access requires a UVM issued identity called a NetID, and an account for billing purposes. In my experience the account usually comes from an advisor or from the instructor of a class, and every student should already have a NetID.

The VACC has a graphical user interface called OnDemand which may better suit simple manual tasks, but here I focus on building blocks for automation using the Linux command line. I run a Linux distribution on my laptop that is based on Debian, but MacOS, Windows, or other varieties of Linux should suffice with minimal adaptation; most of the code just runs on the VACC anyway. The idea is to upload a python project to the VACC, queue jobs using the SLURM scheduling system, and download the results when the jobs finish.

Connecting

Interacting with the VACC requires a secure channel of communication in the form of a ssh connection to one of its login nodes behind login.vacc.uvm.edu. The login nodes unfortunately do not support public key authentication, only authentication by username and password. The username is a valid NetID, and the password is the password associated with that NetID. I run this code on my laptop. I use environment variables, but they aren't necessary.

# Host is the VACC login node
HOST=${LOGIN_NODE}

# Assuming the username is bob
NETID=bob

# Install ssh and scp clients
sudo apt install openssh-client

# Test the connection
ssh $NETID@$HOST

# If working then exit when done
exit

The first time I connect, ssh is unable to establish the host's authenticity and prompts me to confirm the connection by entering yes. This prompt causes the sshpass utility I introduce next to fail, so I get it out of the way now; the prompt will not reappear for subsequent connection attempts. Or better yet, perhaps there is a switch to circumvent the authenticity check altogether? Also, the VACC documentation suggests I must be on campus or on the university's VPN to connect, but so far I have had no issues.

Authentication by username and password has the drawback that any scripts I eventually write will prompt me for my password every time they need to connect. One workaround not without security risks is to put the password in a local file. I can then use sshpass to feed the password to ssh or scp.

# Create a directory for password file
mkdir $HOME/.sshpasswds

# Path to file containing password
PWPATH=$HOME/.sshpasswds/uvm

# Save password in a file
vim $PWPATH

# Optionally lock down access somewhat
chmod 400 $PWPATH

# Install sshpass if not yet installed
sudo apt install sshpass

# Optionally test ssh with sshpass
sshpass -f $PWPATH ssh $NETID@$HOST pwd

The optional test on the last line outputs my home directory on the login node. Then ssh lands me back on my laptop right away when the command finishes. See also this tutorial for more about sshpass and its switches.

An inconvenience I have not solved is that the VACC sometimes forces a cooldown period if I connect too many times. It doesn't really happen during normal use, but it can happen during the iterative process of debugging a script. I recall reading somewhere about a way to bundle many ssh and scp operations into a single remote session, which may help avoid this.

Project organization

The python community maintains an enormous number of complicated tools to manage python installations, project dependencies, and version conflicts. Different projects use different tools. The VACC strongly encourages projects to use conda, so I add an environment.yaml file to my project, defining the conda environment; but to support the use of my project without conda, I list all my python package dependencies in a requirements.txt file like so.

mpyc==0.10
numpy==2.0.1
pandas==2.2.2

Then the environment.yaml file for conda essentially becomes a thin wrapper around pip. The environment.yaml file tells conda to install pip and then use it to install the package dependencies from requirements.txt. Another benefit of this approach is the certainty of having access to all the packages. Not all packages are available in conda natively.

name: myproject
channels:
  - defaults
dependencies:
  - python=3.11
  - pip
  - pip:
    - -r requirements.txt

I also like to install the project itself as a package. Then if the project had multiple source files, I could freely import them from each other without worrying about their relative paths. Installing the project as a package also helps with unit testing. Different approaches are possible. I use a pyproject.toml file. This configuration works because all my source code is in a src directory in my project, and I do not have any nested source directories.

[project]
name = "src"
version = "0.1"

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.setuptools]
packages = ["src"]

I install the project as a package in editable mode. Editable mode means that changes to the source code get immediately reflected in the installed package: a very useful feature for iterative development. I do this step locally on my laptop, as well as on the VACC.

python -m pip install -e .

At risk of further complicating an already complicated process, I use pyenv on my laptop to isolate from each other all the versions of python, build tools, and package dependencies which the projects I work on require. I create a new environment with pyenv, and add a .python-version file to my project containing the name of the environment. My bash shell detects this file when I enter the project directory, and has pyenv activate the appropriate environment automatically. Conversely when I leave the project directory, pyenv deactivates the environment again.

The scripts

The remaining code snippets fit nicely into scripts. I keep the scripts related to my project in the script directory of my project. Some scripts are for running on my laptop, and others are for running on the VACC. Some scripts are specific to projects like mine that are based on a grid search.

Script	Purpose	Target
`copy_project.sh`	Copies the project from my laptop to the VACC. Recreates a conda environment on the VACC with the project's stated python package dependencies as needed.	Laptop
`copy_output.sh`	Optionally verifies none of the project's jobs are running on the VACC. Deletes and recreates the output directory on my laptop. Recursively copies the results directory from VACC to my laptop.	Laptop
`vacc_init_env.sh`	Recreates a conda environment on the VACC with the project's stated python package dependencies as needed.	VACC
`set_vars.sh`	Sets common project related environment variables for use in other scripts.	Both
`vacc_batch.sh` Grid search only	Creates many small bash scripts, each of which submits a SLURM job using particular experimental parameters when run. Projects searching over multiple grids may have multiple batch scripts.	VACC
`vacc_job.sh` Grid search only	Loads the project's conda environment. Calls the main python script, supplying appropriate command line arguments from environment variables.	VACC
`vacc_run_all.sh` Grid search only	Runs all the bash scripts created by `vacc_batch.sh` at once, thereby submitting many jobs to SLURM.	VACC

A number of environment variables are shared across the scripts, both the scripts that run on my laptop and those that run on the VACC. Many of these we have seen and used already. I add them to set_vars.sh and use source at the beginning of the other scripts to make them available.

# Host is the VACC login node
HOST=${LOGIN_NODE}

# Assuming the username is bob
NETID=bob

# Name of the project
PRJNAME=project

# Where project lives locally
PRJPATH=$HOME/src/$PRJNAME

# Path to password (whatever you choose)
PWPATH=$HOME/.sshpasswds/uvm

# Where output lives on VACC
OUTPATH=out

# Where output will be analyzed on laptop
ANPATH=$HOME/analysis

# Conda module
CONMOD=python3.11-anaconda/2023.09-0

The last environment variable determines what version of python the project will run under. To see what options are available on the VACC, I open a ssh session there, run module avail, note the modules that start with python, and press q to exit.

Uploading the project

The operating system package that installs ssh also comes with a utility called scp for copying files securely between computers. Its syntax is similar to cp, but with either the source or destination prefixed by the username and host. The main responsibility of copy_project.sh is to upload the project to my home directory on the VACC using scp. The -r switch recursively uploads any subdirectories the project has.

# Delete any older version of the project on the VACC
CMD="rm -rf $PRJNAME"
sshpass -f $PWPATH ssh $NETID@$HOST "$CMD"

# Copy project to VACC
sshpass -f $PWPATH scp -rC $PRJPATH $NETID@$HOST:

The VACC uses module to install conda, and then uses conda to install all the project's dependencies. I add all these installation steps to vacc_init_env.sh.

# Clear out any loaded modules
module purge

# Load python and anaconda
module load ${CONMOD}

# Put anaconda binaries on the path
source ${ANACONDA_ROOT}/etc/profile.d/conda.sh

The vacc_init_env.sh script goes on to configure the conda environment. I recreate the environment only if necessary, and install without prompting for confirmation. The --prune switch removes any dependencies I may have removed from my project.

# Create the environment if needed
if ! conda env list | grep $PRJNAME >/dev/null 2>&1; then
    yes | conda create --name $PRJNAME python=3.11
fi

# Update the environment with latest project dependencies
conda env update --file environment.yaml --prune
conda activate $PRJNAME

# Install ourself locally
python -m pip install -e .

One option is to run vacc_init_env.sh just once from a ssh session at the outset, or to perform the steps manually; but I prefer to call vacc_init_env.sh over ssh from copy_project.sh. In this way I recreate the environment every time I copy a new version of the project to the VACC and never worry about whether the environment under which the project will run reflects any changes I recently made to its dependencies.

CMD="bash $PRJNAME/script/vacc_init_env.sh"
sshpass -f $PWPATH ssh $NETID@$HOST "$CMD"

How this works is that when I run copy_project.sh on my laptop, the vacc_init_env.sh script gets copied to the VACC along with the rest of the project. That is what makes it available to call there. Of course it is important remember to source the set_vars.sh script at the beginning of all these other scripts.

Starting jobs

Different projects use the VACC in different ways. Some projects only start a single job that runs within the maximum allotted time. Other projects are best conceptualized as a single job, but if a single job cannot finish within the maximum allotted time, it must save its progress and resume in a subsequent job, and so on. The project I describe is unlike these. It is a grid search: many jobs run independently of each other, and ideally simultaneously, all using the same code but parameterized in slightly different ways. The grid is like a high dimensional rectangle of all the combinations of parameters under which the job should run.

The VACC uses the SLURM job scheduler to manage jobs. This means that submitted jobs do not necessarily start right away, but rather enter a job queue. SLURM accepts job submissions using the sbatch utility, and later starts them once the resources they need are available on the supercomputing cluster.

While submitting jobs from my laptop is possible by sending one-off commands across individual ssh sessions as shown previously, I prefer doing it directly on the VACC from an interactive ssh session. I also prefer not submitting jobs directly from vacc_batch.sh, but rather having vacc_batch.sh create many small bash scripts, each of which submits a job when run. This allows several advantages:

I can carefully review the actual sbatch commands that will submit my jobs. I can see how many jobs there are. I can see what the environment variables in each job are.
I have the full flexibility of Linux filesystem commands to control what jobs get run, and which will not. I can easily rerun individual failed jobs.
I can still run all the job submission scripts at once with a simple script like vacc_run_all.sh, so this extra step of creating individual submission scripts adds no real inconvenience.

The vacc_batch.sh script starts by getting the directory it lives in. This is important so it knows the correct fully qualified path to use for the actual script that runs the job, which lives in the same directory.

SCRIPT_DIR=$(dirname "$(realpath "$0")")

I maintain variables for the name of the batch and a number I assign sequentially to the jobs within that batch. I include the batch name and job number in the filename of the job's submission script, in the filename of the job's output log, and in the filename of whatever result the job produces. This allows easy cross-referencing between these three files.

batch="example_batch"
job=0

This version of my script assumes all the jobs in the batch require the same resources to run. The script will need to be more sophisticated if this is not the case. Another way to solve that problem is to use multiple batches for the different resource requirements.

The nested loops iterate over all the combinations of parameter values. They export the parameter values for use by the job submission script. The submission scripts get created in a stage directory in my home directory on the VACC.

# Parameter ranges
p1s=("1" "2" "3" "4")
p2s=("a" "b" "c")

for p1 in "${p1s[@]}";
do
  for p2 in "${p2s[@]}";
    ((job++))
    outfile="$HOME/stage/${batch}_$job.sh"
    echo "sbatch
      --nodes=1
      --time=12:00:00
      --ntasks=1
      --cpus-per-task=1
      --mem=16G
      --job-name=${batch}_$job
      --output=${batch}_$job.out
      \"--export=batch=$batch,job=$job,p1=$p1,p2=$p2\"
      \"$SCRIPT_DIR/vacc_job.sh\"" > "$outfile"
  done
done

The submission scripts all call vacc_job.sh, which assumes the batch environment variable is set to the name of the batch, that job is set to the job number, and that the experimental parameters p1 and p2 are set to their desired values. With that, vacc_job.sh loads the project's conda environment and runs main.py with the appropriate command line arguments. I do not show the python script here, but it should use argparse or something similar to read the command line arguments.

# Load conda environment
module load python3.11-anaconda/2024.02-1
source ${ANACONDA_ROOT}/etc/profile.d/conda.sh
conda activate mpyc-random-forest

cd $HOME/$PRJNAME
python src/main.py \
  --batch=$batch \
  --job=$job \
  --p1=$p1 \
  --p2=$p2

As mentioned, a convenient way to run all the generated submission scripts at once like vacc_run_all.sh will be helpful.

for script in $HOME/stage/*.sh; do
  bash "$script"
done

This script runs all the bash scripts it finds in the stage directory.

Downloading the output

The last step is to copy its output back to my laptop with copy_output.sh for analysis, but how can I know if the batch finished? SLURM can send email notifications when individual jobs finish, but getting alerted when a whole batch finishes would require a long-running script, as far as I can tell. I instead determined whether the batch finished with a simple manual check.

sshpass -f $PWPATH ssh $NETID@$HOST "squeue --me"

The check shows any SLURM jobs associated with my user. If the list is empty, I can go ahead and copy the batch output. The check is good enough for my purposes, but it has weaknesses that could hinder automation:

I may be running other jobs on the VACC besides those in the batch. Then the list would contain jobs even though the batch finished. Luckily grep can help here because I prefixed the job names with the batch name. Perhaps one of the many available squeue switches could also help.
The check cannot distinguish between jobs that finished because they succeeded, and jobs that finished because they failed. Even though SLURM has job statuses for Success and Failure, it forgets the jobs shortly after the jobs finish. Possibly this has more to do with the configuration of SLURM on the VACC than with SLURM generally. My jobs only write output when they succeed, so instead I identify job failures by the conjunction of an empty queue and missing output files.

Beyond the question of whether the batch finished (and finished successfully), copy_output.sh is a straightforward process of preparing space for the output files on my laptop, and copying them over with the scp utility.

# Remove any old results
rm -rf $ANPATH

# Recreate results
mkdir $ANPATH

# Download results to directory
sshpass -f $PWPATH scp -rC $NETID@$HOST:$OUTPATH $ANPATH

And with the experimental results on my laptop, the joy of data analysis can begin!