Policies relating to the Zaratan High-Performance Computing Cluster

Please note that this page is still under construction. Therefore not all policies related to the Zaratan cluster are currently listed here.

General Policies
1. Access for non-UMD persons
2. Access requires a valid HPC allocation
Policies on Usage of Login Nodes
Policies on Usage of Disk Space

General Policies on Usage of High Performance Computing Clusters

The High Performance Computing (HPC) Clusters are part of the information technology resources that the Division of Information Technology makes available to the university community, and as such are covered by the campus Acceptable Use Policy (AUP). All users of the HPC clusters are required to adhere by the campus AUP in addition to the policies specific to the HPC clusters.

You should read and familiarize yourself with the Acceptable Use Policy. The AUP includes the following provisions which might be particularly applicable to users of the HPC clusters, but the list below is NOT complete and you are bound by all of the policies in the AUP.

"Those using university IT resources [...] are responsible for [...] safeguarding identification codes and passwords" DO NOT SHARE YOUR PASSWORD with anyone. If you have a student, colleague, etc. who needs access to your HPC allocation, request they get added to your allocation.
"Engaging in conduct that interferes with others' use of shared IT resources" is prohibited. The HPC cluster is a shared resource. The specific policies below about use of login nodes and disk space are related to this point.
"Using university IT resources for commercial or profit-making purposes" is prohibited without written authorization from the university.

In addition to the AUP, the HPC clusters have there own policies enumerated in this document. Among these are:

You are required to promptly comply with all direct requests from HPC systems staff regarding the use of the clusters. This includes requests to reduce disk space consumption or refrain from particular actions. We purposefully are trying to keep the list of rigid policy rules as short as possible in order to facilitate the use of this research tool in novel and creative ways. However, when we encounter behavior or practices which are interfering with the ability of others to use this shared resource, we will step in and require your prompt compliance with such requests.
You are required to monitor your USERNAME@umd.edu (or USERNAME@terpmail.umd.edu) email address. You can read it on a campus mail system, or forward it to another address or email system which you do read, but that is the address at which system staff will contact you if we need to, and you are expected to be monitoring it.
You are required to subscribe to the HPCC-ANNOUNCE mailing list. You should not need to do anything with regards to this requirement --- users will be automatically subscribed to the list when access to one of the Deepthought* clusters is granted, and will be removed when they no longer have access. This is a low freqency mailing list and is used to inform users of issues, planned maintenance, and other important matters related to the clusters.

Access for non-UMD persons

The various HPC systems provided by the University are for the use of UMD faculty, students, and staff. If you do not have a current posting with the University and are not currently registered for classes at the University, you are in general not eligible to have an account on any of the UMD provided HPC clusters. This includes researchers who have moved to another university and students who have graduated and are not continuing on at UMD.

Because it is recognized that there are research and academic collaborations between people at UMD and people at other institutions, there is some provision for granting access to UMD resources to persons not formally associated with the University of Maryland when they are working with researchers at UMD. This is through the affiliate process; more information regarding the affiliate process can be found here.

People who once were associated with the University but are not currently associated with UMD (e.g. researchers who have moved on from UMD, students who have graduated from UMD, affiliates who were not renewed) will have there access to the HPC clusters revoked. The exact timing depends on the nature of the former association --- e.g. student accounts will be disabled after two consecutive semesters for which they are not enrolled (i.e. about one year from graduation), accounts for non-student researchers will typically expire between 1 and 6 months after the appointment is terminated, depending on the status of the appointment. Once the account is disabled, access to the clusters will be disabled. In such cases, we ask that you delete any unneeded data from your home and lustre directories, and transfer any data worth saving off the system before your account expires --- any remaining data will be disposed of pursuant to HPC policies.

If you are continuing to work with researchers at UMD and need to retain access to the clusters, you will need to have your UMD colleagues request affiliate status for you.

Access requires a valid HPC allocation

Access to the various HPC cluster requires a valid allocation to charge jobs against. You will be automatically granted access to the cluster when the designated point-of-contact for a valid allocation on the cluster requests that you be granted access to the allocation. Your access to the cluster will automatically be revoked when you are no longer associated with any valid allocations on the cluster. Your association with an allocation will terminate when any of the following occur:

The point-of-contact for the allocation requests that you no longer be allowed to charge against the allocation.
The allocation in question expires.
Your association with UMD ceases.

If the allocation expires, you can try to renew it. Allocations from Engineering should talk to Jim Zahniser; allocations from CMNS should talk to Mike Landavere, and allocations from the Allocations and Advisory Committee (AAC) should follow the instructions for applying for AAC allocations.

In all cases, we ask that you delete any unneeded files from the cluster, and move all files off the cluster before your access is disabled as a courtesy to other users of the clusters. Although any remaining data will be disposed of pursuant to HPC policies, removing the data yourself will free up space on the cluster sooner.

Policies on Usage of Login Nodes

The login nodes are provided for people to access the HPC clusters. They are intended for people to setup and submit jobs, access results from jobs, transfer data to/from the cluster, compiling code, installing software, editing and managing files, etc. As a courtesy to your colleagues, you should refrain from doing anything long running or computationally intensive on these nodes as it will interfere with the ability of others to use the HPC resources. Computationally intensive tasks should be submitted as jobs to the compute nodes (e.g. using sbatch or sinteractive), as that is what compute nodes are for.

Most compilations of code are short and are permissible. If you are doing a very parallel or long compilation, you should consider requesting an interactive job and doing your compilation there as a courtesy to your colleagues.

Compute intensive calculations, etc. are NOT allowed on the login nodes. If system staff find such jobs running, we will kill them without prior notification. Users found in violation of this policy will be warned, and continued violation may result in suspension of access to the cluster.

Do NOT run compute intensive calculations on the login nodes

Policies on Usage of Disk Space

The Division of Information Technology and the various contributing research groups have provided large amounts of disk space for the support of jobs using the Zaratan HPC Cluster. The following policies discuss the use of this space. In general, the disk space is intended for support of research using the cluster, and as a courtesy to other users of the cluster you should try to delete any files that are no longer needed or being used.

All data on the HPC clusters, including home, scratch, and SHELL filesystems, are considered to be related to your research and not to be of a personal nature. As such, all data is considered to be owned by the principal investigator(s) for the allocation(s) through which you have access to the cluster.

All Division of Information Technology provided scratch filesystems are for the support of active research using the clusters. You must remove your data files, etc. from the cluster promptly when you no longer have jobs on the clusters requiring them. This is to ensure that all users can avail themselves of these resources.

The ONLY filesystems backed up by the Division of Information Technology on the HPC clusters are the homespaces. Everything else might be irrecoverably lost if there is a hardware failure. So copy your precious files (e.g. custom codes, summarized data) to your home directory for safety.

For the purposes of HPCC documentation and policies, the disk space available to users of the cluster is categorized as indicated below.

home space: This is the directory which you see when you log into the systems in the clusters. This home directory is distinct from your normal Glue/TerpConnect home directory, and is distinct between the different HPC clusters. It is visible to all nodes within the specific HPC cluster, but is not visible anywhere else, including other HPC clusters. Home space is provided by the Division to all HPCC users, and is backed up to tape nightly. This is intended for relatively small amounts of valuable information: codes, scripts, configuration files, etc. It is not as highly optimized for performance as the scratch volumes, and so you should avoid doing heavy I/O to your home space in your jobs. Policies related to homespace
Division of Information Technology provided data space: This includes BeeGFS scratch space (e.g. /scratch/zt1 on Zaratan) It is provided by the Division of IT and is visible to all nodes in the cluster. All HPCC users can access it (although if your research group has its own data volumes, we request that you use that preferentially.) Research-owned scratch space is just a reservation of the total scratch space for that research group, so there is no user-visible difference between storage owned by research groups and DIT-owned scratch storage. Scratch space is much better optimized for performance than the home space volumes, but jobs doing heavy I/O should still seriously investigate using local temporary space instead. Scratch storage is not backed up to tape, but allows for more storage than home space volumes. Still, remember to store critical data on the home space which is backed up. Policies related to DIT provided data space.
Research group provided data space: Some research groups have purchased additional data space for use by their members. This is generally part of the BeeGFS scratch filesystem. Research groups can buy additional scratch storage, which is added into the total scratch pool and then an amount equal to the contribution is reserved for that group's use. Policies related to research group provided data space.
local temporary space: Each compute node has local temporary space available as /tmp. For Zaratan nodes, this amount is about 1.5TB. This space is available for use by your job while it is running; any files left there are deleted when the job ends. This space is not backed up, and files will be deleted without notice when job ends. This space is only visible to the node it is attached to; each node of a multinode job will see its own copy of /tmp which will differ from /tmp on the other nodes. However, being directly attached, this space will have significantly better performance than network mounted volumes. Policies related to local temporary space.
DIT-provided longer term storage:

On the Zaratan cluster, there is about 10 PB of medium-term storage available for storing data which although important is not being actively used by jobs. Please see the section on SHELL storage on the Zaratan cluster for more information.
Archival data can also be stored on Google's G Suite drive. This can hold large amounts of data, although transfer times can be less than optimal. See the section on archival storage using G drive for more information.

These options are available for the storage of files and data not associated with active research on the cluster (such files should not be stored in scratch filesystems). This is useful for data which needs to be kept but rarely accessed, e.g. after a paper is published, etc. While there is no time limit on how long data can stay in these locations, it is still requested that you delete items after they are no longer needed. Policies related to longer term storage

The SHELL filesystem is the ONLY place provided by the Division of Information Technology for the storage of data not being actively used by computations on the cluster.

A list of all data volumes

Policies on Usage of Home Space

Do NOT start jobs from your home directory or subdirectories underneath it. Run the jobs from the scratch filesystem.
Jobs should not perform significant I/O to/from homespace volumes. Use the scratch filesystem, or the locally attached temporary space(/tmp).
Delete or move off the HPCC any files which are no longer needed or used.
There is a 10 GB soft quota on home directories. This soft quota will not prevent you from storing more than 10 GB in your home directory, however, a daily check of disk usage will be performed and if you are above the quota you will receive an email requesting that you reduce disk usage within a grace period of 7 days. The email reminders will continue until usage is reduced or the grace period is over. If you are still overquota at that time, system staff will be notified and more severe emails will be sent, and unless the situation is remedied prompty system staff may be forced to take action, which could involve relocating or delting your files. This soft quota approach is being taken to ensure all HPCC users get fair access to this critical resource without unduly impacting performance on the cluster and allowing you some flexibility if you need to exceed the 10 GB limit for a few days.

Policies on Usage of Division of Information Technology Provided Data Space

Delete or move off the HPCC any files which are no longer needed or used. This space is intended to provide temporary storage for the support of jobs running on the system; it is not for archival purposes. Files which are not actively being used by computations on the cluster must be removed prompty to ensure these resources are available for other users.
Scratch filesystems are subject to a 90 days purge policy. Any files older than 90 days will be automatically removed without warning.
Files in scratch or SHELL storage are not backed up.

The DIT provided scratch space is NOT for archival storage. It is ONLY for the temporary storage of files supporting active research on the clusters. You must remove any data which is no longer needed for jobs you are running on the cluster promptly.

Scratch and SHELL spaces are NOT backed up.

Files in the scratch filesystem are subject to a 90 days purge policy. This means that files older than 90 days will be automatically removed without warning. If you need to keep data longer than this, consider moving it to your SHELL space, or off of the cluster entirely.

Policies on Usage of Locally Attached Temporary Space

Please have your jobs use locally attached temporary space (/tmp) wherever it is feasible. This generally offers the best disk I/O performance. Contact hpcc-help if you have questions or need assistance with that.
Files in locally attached temporary space are not backed up.
Files in locally attached temporary space are deleted upon termination of the job.
Although all files in /tmp that belong to you will be deleted when you no longer have any jobs running on the node, it is good practice to delete files yourself at the end of the job where possible. Especially if you run many small jobs that can share a node; as otherwise it can take some time for the automatic deletion to occur and that can reduce the available space in /tmp for other jobs.

Any files you own under /tmp on a compute node will be deleted once the last job of yours running on the node terminates (i.e. when you no longer have any jobs running on the node).

DIT-provided longer term storage

The SHELL volumes and Google's G drive are the ONLY DIT-provided storage where it is permissible to store files and data not associated with active research on the cluster. It can be used to archive data e.g. that needs to be kept for a while after a paper is published.
The SHELL volumes are only available from the login nodes of the Zaratan cluster, or to external clients. They are NOT available from the compute nodes.
Do not use this storage for active jobs.
These volumes are NOT backed up.
Google's G drive storage is NOT on campus, and as such there may be restrictions on what types of data is allowed to be stored there (from a security perspective). Please see the Google drive service catalog entry for more information regarding this.

The SHELL volumes are NOT backed up.

Policies Regarding Data of Former Users

Over time, users on the cluster will come and go. Because of the large volume of data that some users have, it is necessary to have policies regarding the disposal of this data when users leave the university or otherwise lose access to the cluster in order to protect potentially valuable data but also prevent valuable cluster resources from being needlessly tied up due to files from users no longer on the cluster.

Disposal of Former User Data on the Zaratan Cluster

All active users on the Zaratan cluster belong to one or more allocations, and lose access to the cluster when they no longer are associated with any allocations, be it because they ceased being associated with the University or the research group owning the allocation, or the allocation expired. When this happens:

All of the data owned by the user (both in their home directory and/or in their scratch or SHELL directories) is "quarantined". I.e. it is relocated and access to this data is disabled for all users, but it is still consuming space on the filesystem. This is to ensure anyone who is using this data, whether cognizant of their use of it or not, should quickly notice that it is gone, and so hopefully things can be resolved before the data is permanently deleted. If you need (or think you need) access to data from someone whose access was recently disabled,
For every allocation the user who previously owned the data was in just before the user was disabled, the point-of-contacts (PoCs) of those allocations will receive email informing them that the data previously owned by that user is slated for deletion. Emails will be repeated monthly as long as the data remains in "quarantined" (or the PoC gives approval for early deletion of the data).
NOTE: Only the PoCs for allocations that the user belonged to just before being disabled will receive these notifications. E.g., if an user is a member of allocations AllocA and AllocB, and then is removed from AllocA (typically at the request of a PoC for AllocA), and then some weeks or months later is removed from AllocB (either due to expiration of account or at request of a PoC), only PoCs for AllocB will receiving recurring emails about the "quarantined" data. PoCs for AllocA will receive email at the time that the user was removed from AllocA, and this email will mention that they should make arrangements regarding transferal of ownership of any data belonging to AllocA, but no data is "quarantined" at that time (because the user still has access to the cluster).
They data will remain in "quarantine" until one of the following conditions occur:
- The expiration date for the data has passed. The expiration date is set to 6 months after the account was originally disabled. At this point, the data has been "quarantined" for (and therefore not accessed for at least) six months, and so is beyond the age at which DIT staff reserve the right to delete anyway.
- PoCs representing all of the allocations to which the user had been associated just before being disabled have given approval for the early deletion of the data. This is to allow freeing up of resources ahead of the normal 6 month policy, but is only done if representatives of ALL allocations agree to it.
- If any PoC from an allocation the user had been associated with just before the account was disabled requests that the data be transferred to another user, the data will be transferred (as per HPC policy, all data belongs to the allocations to which the user is a member). This does not require consent of all PoCs involved, and it is assumed that should multiple PoCs need different parts of the data things can be worked out in a friendly manner.