Parallelizing DL workloads on multiple GPUs

Modern deep-learning challenges leverage increasingly larger datasets and more complex models. As a result, significant computational power is required to train models effectively and efficiently. Learning to distribute data across multiple GPUs during deep learning model training makes possible an incredible wealth of new applications utilizing deep learning (DL).

This workshop is targeted at members of the UMD community, interested in accelerating DL training in multi-GPU environments, for instance on UMD's Zaratan cluster. The effect of batch size as well as other considerations of training performance and accuracy for single and multiple GPU workloads will be discussed using PyTorch and PyTorch Distributed Data Parallel. We plan to offer it a number of times throughout the year in the future.

This workshop will cover the following topics:

Significance of stochastic gradient descent and effects of batch size
Key algorithmic considerations to retain accuracy when training on multiple GPUs

Finally, skills-based coding assessments will evaluate the participant's ability to train deep learning models on multiple GPUs.

Basic knowledge of the Python programming language and the use of Jupyter Notebook is assumed; previous experience with deep learning training using PyTorch is beneficial.
Please ensure that your laptop (or desktop) has the latest version of Firefox or Chrome installed. Each participant will be provided with dedicated access to GPU-accelerated servers.

More details (and a link to the registration form) can be found on the page for specific offerings of this workshop. The next scheduled workshops are (or last offered workshop if no workshops are currently scheduled):

Fri, 14 February & 21 February 2025 (split into 2 half-day workshops)