pytorch: Open-source machine learning library

Overview of package
1. General usage
Availability of package by cluster
Notes on using Singularity based pytorch images
GPU support
Distributed pytorch

Overview of package

General information about package
Package:	pytorch
Description:	Open-source machine learning library
For more information:	https://pytorch.org/
Categories:	ComputerVision MachineLearning NumericalMethods PythonModule
License:	OpenSource (BSD)

General usage information

PyTorch is an open-source machine learning library that evolved from the (no longer supported) Lua based Torch library. It is commonly used for computer vision and natural language processing tasks. Among its high level features are:

Tensor computing (like NumPy) with strong GPU acceleration
Deep neural networks built on a tape-based automatic differentiation system

This module will add the torch, torchvision, and torchsummary modules to your PYTHONPATH

Available versions of the package pytorch, by cluster

This section lists the available versions of the package pytorchon the different clusters.

Available versions of pytorch on the Zaratan cluster

Available versions of pytorch on the Zaratan cluster
Version	Module tags	CPU(s) optimized for	GPU ready?
2.0.1	pytorch/2.0.1	icelake, zen2	Y
1.11.0	pytorch/1.11.0	zen2	Y

Please note that despite the module being named 'pytorch', you should use 'import torch' or similar in your Python code.

NOTE for RHEL6 users: The installations of pytorch on the RHEL6 nodes of Deepthought2 are not native installations, but based on Singularity containers. This is necessitated by the complexity of installing pytorch natively on RHEL6. A section of this document is focused on using Singularity based pytorch images.

Notes on using Singularity based pytorch images

The PyTorch package is NOT natively installed on the RHEL6 nodes of the Deepthought2 cluster for various technical reasons. What is provided instead are Singularity containers which have versions of both python2 and python3 installed with support for PyTorch and related python packages.

To use the PyTorch python package, you must load the appropriate environmental module (e.g. module load pytorch) and then launch the python interpretter inside the Singularity container. Note: you cannot access the torch/pytorch python packages within the native python installations (e.g. module load python), you must use the python installation in the container for PyTorch.

To assist with this, the following wrapper scripts have been provided:

pytorch: Will launch the python2 interpretter within the container, with support for the torch/pytorch package as well as various other packages. Any arguments given will be passed to the python interpretter, so you can do something like pytorch myscript.py.
pytorch-python2: This is the same as pytorch, for completeness and symmetry.
pytorch-python3: This is like pytorch, except that a python3 interpretter with support for the torch/pytorch package will be invoked.

p>Please note in all cases, the name of the module to import is torch, not pytorch.

In all cases, any arguments given to the wrapper scripts are passed directly to the python interpretter running within the container. E.g., you can provide the name of a python script, and that python script will run in the python container running inside your container. Your home and lustre directories are accessible from within the container, so you can read and write to files in those directories as usual.

Note that if you load the pytorch environmental module (e.g. module load pytorch and then issue the python command, you will start up a natively installed python interpretter which does NOT have the pytorch/torch python package installed. You need to start one of the python interpretters inside the container to get these packages --- you can either do that using the correct singularity command, or use the friendlier wrapper scripts described above.

It is hoped that for most users, the "containerization" of this package should not cause any real issues, and hopefully not even really be noticed. However, there are some limitations to the use of containers:

In general, you will not have access to natively installed software, just the software included in the container. So even if some package foo is installed natively on Deepthought2, it is likely not accessible from within the container (unless there is a version of it also installed inside the container).
You will likely not be able to use the python virtualenv scripts to install new python packages for use withing the container, as the virtualenv command will be installing packages natively, which would not then be available inside the container.

However, you are permitted to create your own Singularity containers and to use them on the Deepthought2 cluster. You will need to have root access on some system (e.g. your workstation or desktop) with Singularity installed to build your own containers (we cannot provide you root access on the Deepthought2 login or compute nodes). You can also copy system provided containers and edit them. More details can be found under the software page for Singularity.

GPU suppport

The PyTorch package can make use of GPUs on nodes with GPUs. There is nothing special that needs to be done in the module load or the various pytorch* commands, but you will need to instruct the package to use the GPUs within your python code. This is typically done by replacing a line like

device = torch.device("cpu")

with something like

device = torch.device("cuda:0")

Distributed pytorch

Although the Singularity containers with pytorch do not have MPI support, pytorch has its own distributed package (torch.distributed) which can handle parallelizing your computations across multiple nodes. More information on using torch.distributed in your Python codes can be found at the PyTorch Distributed Tutorial and the Distributed Communication Documentation.

We recommend using the TCP based initialization, using something like the example script below:

#!/bin/bash
#SBATCH -ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

#Define module command, etc
. ~/.profile
#Load the pytorch module
module load pytorch/0.4.1

#Number of processes per node to launch (20 for CPU, 2 for GPU)
NPROC_PER_NODE=20

#This command to run your pytorch script
#You will want to replace this
COMMAND="YOUR_TRAINING_SCRIPT.py --arg1 --arg2 ..."

#We want names of master and slave nodes
MASTER=`/bin/hostname -s`
SLAVES=`scontrol show hostnames $SLURM_JOB_NODELIST | grep -v $MASTER`
#Make sure this node (MASTER) comes first
HOSTLIST="$MASTER $SLAVES"

#Get a random unused port on this host(MASTER) between 2000 and 9999
#First line gets list of unused ports
#2nd line restricts between 2000 and 9999
#3rd line gets single random port from the list
MPORT=`ss -tan | awk '{print $4}' | cut -d':' -f2 | \
        grep "[2-9][0-9]\{3,3\}" | grep -v "[0-9]\{5,5\}" | \
        sort | uniq | shuf`



#Launch the pytorch processes, first on master (first in $HOSTLIST) then
#on the slaves
RANK=0
for node in $HOSTLIST; do
        ssh -q $node \
                pytorch -m torch.distributed.launch \
                --nproces_per_node=$NPROCS_PER_NODE \
                --nnodes=$SLURM_JOB_NUM_NODES \
                --node_rank=$RANK \
                --master_addr="$MASTER" --master_port="$MPORT" \
                $COMMAND &
        RANK=$((RANK+1))
done
wait

The python code should have a structure looking something like:

import argparse
import torch.distributed as dist

parser.add_argument('--distributed', action='store_true', help='enables distributed processes')
parser.add_argument('--local_rank', default=0, type=int, help='number of distributed processes')
parser.add_argument('--dist_backend', default='gloo', type=str, help='distributed backend')

def main():
opt = parser.parse_args()
if opt.distributed:
    dist.init_process_group(backend=opt.dist_backend, init_method='env://')

print("Initialized Rank:", dist.get_rank())

if __name__ == '__main__':
    main()