Package: | pytorch |
---|---|
Description: | Open-source machine learning library |
For more information: | https://pytorch.org/ |
Categories: | |
License: | OpenSource (BSD) |
PyTorch is an open-source machine learning library that evolved from the (no longer supported) Lua based Torch library. It is commonly used for computer vision and natural language processing tasks. Among its high level features are:
This module will add the torch, torchvision, and torchsummary modules to your PYTHONPATH
This section lists the available versions of the package pytorchon the different clusters.
Version | Module tags | CPU(s) optimized for | GPU ready? |
---|---|---|---|
2.0.1 | pytorch/2.0.1 | icelake, zen2 | Y |
1.11.0 | pytorch/1.11.0 | zen2 | Y |
Please note that despite the module being named 'pytorch', you should use 'import torch' or similar in your Python code.
NOTE for RHEL6 users: The installations of pytorch on the RHEL6 nodes of Deepthought2 are not native installations, but based on Singularity containers. This is necessitated by the complexity of installing pytorch natively on RHEL6. A section of this document is focused on using Singularity based pytorch images.
The PyTorch package is NOT natively installed on the RHEL6 nodes of the Deepthought2 cluster for various technical reasons. What is provided instead are Singularity containers which have versions of both python2 and python3 installed with support for PyTorch and related python packages.
To use the PyTorch python package, you must load the appropriate environmental module
(e.g. module load pytorch
) and then launch the python interpretter inside
the Singularity container. Note: you cannot access the torch/pytorch
python packages within the native python installations (e.g. module load python
),
you must use the python installation in the container for PyTorch.
To assist with this, the following wrapper scripts have been provided:
pytorch
: Will launch the python2 interpretter within the container, with support
for the torch/pytorch package as well as various other packages. Any arguments given will be
passed to the python interpretter, so you can do something like pytorch myscript.py
.pytorch-python2
: This is the same as pytorch
, for completeness and symmetry.pytorch-python3
: This is like pytorch
, except that a python3 interpretter
with support for the torch/pytorch package will be invoked.
torch
, not pytorch
.
In all cases, any arguments given to the wrapper scripts are passed directly to the python interpretter running within the container. E.g., you can provide the name of a python script, and that python script will run in the python container running inside your container. Your home and lustre directories are accessible from within the container, so you can read and write to files in those directories as usual.
Note that if you load the pytorch environmental module (e.g. module load pytorch
and then
issue the python
command, you will start up a natively installed python interpretter which does
NOT have the pytorch/torch python package installed. You need to start one of the python
interpretters inside the container to get these packages --- you can either do that using the correct
singularity
command, or use the friendlier wrapper scripts described above.
It is hoped that for most users, the "containerization" of this package should not cause any real issues, and hopefully not even really be noticed. However, there are some limitations to the use of containers:
foo
is installed natively on Deepthought2, it is likely
not accessible from within the container (unless there is a version of it also installed inside the container).virtualenv
scripts to install new python
packages for use withing the container, as the virtualenv command will be installing packages natively, which
would not then be available inside the container.
However, you are permitted to create your own Singularity containers and to use them on the Deepthought2 cluster. You will need to have root access on some system (e.g. your workstation or desktop) with Singularity installed to build your own containers (we cannot provide you root access on the Deepthought2 login or compute nodes). You can also copy system provided containers and edit them. More details can be found under the software page for Singularity.
The PyTorch package can make use of GPUs on nodes with GPUs. There is nothing special that needs to be
done in the module load
or the various pytorch*
commands, but you will need to instruct
the package to use the GPUs within your python code. This is typically done by replacing a line like
device = torch.device("cpu")
device = torch.device("cuda:0")
Although the Singularity containers with pytorch do not have MPI support,
pytorch has its own distributed package (torch.distributed) which can handle
parallelizing your computations across multiple nodes. More information
on using torch.distributed
in your Python codes can be found
at the
PyTorch Distributed Tutorial and the
Distributed
Communication Documentation.
We recommend using the TCP based initialization, using something like the example script below:
#!/bin/bash
#SBATCH -ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive
#Define module command, etc
. ~/.profile
#Load the pytorch module
module load pytorch/0.4.1
#Number of processes per node to launch (20 for CPU, 2 for GPU)
NPROC_PER_NODE=20
#This command to run your pytorch script
#You will want to replace this
COMMAND="YOUR_TRAINING_SCRIPT.py --arg1 --arg2 ..."
#We want names of master and slave nodes
MASTER=`/bin/hostname -s`
SLAVES=`scontrol show hostnames $SLURM_JOB_NODELIST | grep -v $MASTER`
#Make sure this node (MASTER) comes first
HOSTLIST="$MASTER $SLAVES"
#Get a random unused port on this host(MASTER) between 2000 and 9999
#First line gets list of unused ports
#2nd line restricts between 2000 and 9999
#3rd line gets single random port from the list
MPORT=`ss -tan | awk '{print $4}' | cut -d':' -f2 | \
grep "[2-9][0-9]\{3,3\}" | grep -v "[0-9]\{5,5\}" | \
sort | uniq | shuf`
#Launch the pytorch processes, first on master (first in $HOSTLIST) then
#on the slaves
RANK=0
for node in $HOSTLIST; do
ssh -q $node \
pytorch -m torch.distributed.launch \
--nproces_per_node=$NPROCS_PER_NODE \
--nnodes=$SLURM_JOB_NUM_NODES \
--node_rank=$RANK \
--master_addr="$MASTER" --master_port="$MPORT" \
$COMMAND &
RANK=$((RANK+1))
done
wait
The python code should have a structure looking something like:
import argparse
import torch.distributed as dist
parser.add_argument('--distributed', action='store_true', help='enables distributed processes')
parser.add_argument('--local_rank', default=0, type=int, help='number of distributed processes')
parser.add_argument('--dist_backend', default='gloo', type=str, help='distributed backend')
def main():
opt = parser.parse_args()
if opt.distributed:
dist.init_process_group(backend=opt.dist_backend, init_method='env://')
print("Initialized Rank:", dist.get_rank())
if __name__ == '__main__':
main()