Policies relating to the Deepthought High-Performance Computing Clusters
Please note that this page is still under construction. Therefore not all policies related to the Deepthought HPCCs are currently listed here.
Table of Contents
- General Policies
- Policies on Usage of Login Nodes
- Policies on Usage of Disk Space
General Policies on Usage of High Performance Computing Clusters
The High Performance Computing (HPC) Clusters are part of the information technology resources that the Division of Information Technology makes available to the university community, and as such are covered by the campus Acceptable Use Policy (AUP). All users of the HPC clusters are required to adhere by the campus AUP in addition to the policies specific to the HPC clusters. The AUP applies to all HPC clusters made available by the Division of Information Technology, not just the Deepthought clusters. I.e., the MARCC/Bluecrab cluster is still an university IT resource governed by the AUP even though it is housed off-campus.
You should read and familiarize yourself with the Acceptable Use Policy. The AUP includes the following provisions which might be particularly applicable to users of the HPC clusters, but the list below is NOT complete and you are bound by all of the policies in the AUP.
- "Those using university IT resources [...] are responsible for [...] safeguarding identification codes and passwords" DO NOT SHARE YOUR PASSWORD with anyone. If you have a student, colleague, etc. who needs access to your HPC allocation, request they get added to your allocation.
- "Engaging in conduct that interferes with others' use of shared IT resources" is prohibited. The HPC cluster is a shared resource. The specific policies below about use of login nodes and disk space are related to this point.
- "Using university IT resources for commercial or profit-making purposes" is prohibited without written authorization from the university.
In addition to the AUP, the HPC clusters have there own policies enumerated in this document. Among these are:
- You are required to promptly comply with all direct requests from HPC systems staff regarding the use of the clusters. This includes requests to reduce disk space consumption or refrain from particular actions. We purposefully are trying to keep the list of rigid policy rules as short as possible in order to facilitate the use of this research tool in novel and creative ways. However, when we encounter behavior or practices which are interfering with the ability of others to use this shared resource, we will step in and require your prompt compliance with such requests.
- You are required to monitor your
USERNAME@terpmail.umd.edu) email address. You can read it on a campus mail system, or forward it to another address or email system which you do read, but that is the address at which system staff will contact you if we need to, and you are expected to be monitoring it.
- You are required to subscribe to the HPCC-ANNOUNCE mailing list. You should not need to do anything with regards to this requirement --- users will be automatically subscribed to the list when access to one of the Deepthought* clusters is granted, and will be removed when they no longer have access. This is a low freqency mailing list and is used to inform users of issues, planned maintenance, and other important matters related to the clusters.
Access for non-UMD persons
The various HPC systems provided by the University are for the use of UMD faculty, students, and staff. If you do not have a current posting with the University and are not currently registered for classes at the University, you are in general not eligible to have an account on any of the UMD provided HPC clusters. This includes researchers who have moved to another university and students who have graduated and are not continuing on at UMD.
Because it is recognized that there are research and academic collaborations between people at UMD and people at other institutions, there is some provision for granting access to UMD resources to persons not formally associated with the University of Maryland when they are working with researchers at UMD. This is through the affiliate process; more information regarding the affiliate process can be found here.
People who once were associated with the University but are not currently associated with UMD (e.g. researchers who have moved on from UMD, students who have graduated from UMD, affiliates who were not renewed) will have there access to the HPC clusters revoked. The exact timing depends on the nature of the former association --- e.g. student accounts will be disabled after two consecutive semesters for which they are not enrolled (i.e. about one year from graduation), accounts for non-student researchers will typically expire between 1 and 6 months after the appointment is terminated, depending on the status of the appointment. Once the account is disabled, access to the clusters will be disabled. In such cases, we ask that you delete any unneeded data from your home and lustre directories, and transfer any data worth saving off the system before your account expires --- any remaining data will be disposed of pursuant to HPC policies.
If you are continuing to work with researchers at UMD and need to retain access to the clusters, you will need to have your UMD colleagues request affiliate status for you.
Access requires a valid HPC allocation
Access to the various HPC cluster requires a valid allocation to charge jobs against. You will be automatically granted access to the cluster when the designated point-of-contact for a valid allocation on the cluster requests that you be granted access to the allocation. Your access to the cluster will automatically be revoked when you are no longer associated with any valid allocations on the cluster. Your association with an allocation will terminate when any of the following occur:
- The point-of-contact for the allocation requests that you no longer be allowed to charge against the allocation.
- The allocation in question expires.
- Your association with UMD ceases.
If the allocation expires, you can try to renew it. Allocations from Engineering should talk to Jim Zahniser; allocations from CMNS should talk to Mike Landavere, and allocations from the Allocations and Advisory Committee (AAC) should follow the instructions for applying for AAC allocations.
In all cases, we ask that you delete any unneeded files from the cluster, and move all files off the cluster before your access is disabled as a courtesy to other users of the clusters. Although any remaining data will be disposed of pursuant to HPC policies, removing the data yourself will free up space on the cluster sooner.
Policies on Usage of Login Nodes
The login nodes are provided for people to access the HPC clusters. They are intended for people to setup and submit jobs, access results from jobs, transfer data to/from the cluster, compiling code, installing software, editing and managing files, etc. As a courtesy to your colleagues, you should refrain from doing anything long running or computationally intensive on these nodes as it will interfere with the ability of others to use the HPC resources. Computationally intensive tasks should be submitted as jobs to the compute nodes (e.g. using sbatch or sinteractive), as that is what compute nodes are for.
Most compilations of code are short and are permissible. If you are doing a very parallel or long compilation, you should consider requesting an interactive job and doing your compilation there as a courtesy to your colleagues.
Compute intensive calculations, etc. are NOT allowed on the login nodes. If system staff find such jobs running, we will kill them without prior notification. Users found in violation of this policy will be warned, and continued violation may result in suspension of access to the cluster.
Do NOT run compute intensive calculations on the login nodes
Policies on Usage of Disk Space
The Division of Information Technology and the various contributing research groups have provided large amounts of disk space for the support of jobs using the Deepthought HPC Clusters. The following policies discuss the use of this space. In general, the disk space is intended for support of research using the cluster, and as a courtesy to other users of the cluster you should try to delete any files that are no longer needed or being used.
All data on the HPC clusters, including home, data, and lustre filesystems, are considered to be related to your research and not to be of a personal nature. As such, all data is considered to be owned by the principal investigator(s) for the allocation(s) through which you have access to the cluster.
All Division of Information Technology provided lustre and
The ONLY filesystems backed up by the Division of Information Technology on the HPC clusters are the homespaces. Everything else might be irrecoverably lost if there is a hardware failure. So copy your precious files (e.g. custom codes, summarized data) to your home directory for safety.
For the purposes of HPCC documentation and policies, the disk space available to users of the cluster is categorized as indicated below.
- home space:
This is the directory which you see when into the systems in the clusters.
This home directory is distinct from your normal Glue/TerpConnect home directory,
and is distinct between the different HPC clusters.
It is visible to all nodes within the specific HPC cluster, but is not visible
anywhere else, including other HPC clusters. Home space is
provided by the Division to all HPCC users, and is backed up to tape nightly.
This is intended for relatively small amounts of valuable information: codes,
scripts, configuration files, etc. It is not as highly optimized for
performance as the
/data/...volumes, and so you should avoid doing heavy I/O to your home space in your jobs. Policies related to homespace
- Division of Information Technology provided data space:
This includes lustre (e.g.
/export/lustre_1on Deepthought and
/lustreon Deepthought2) and NFS mounted data storage (
/data/dt-*). It is provided by the Division of IT and is visible to all nodes in the cluster. All HPCC users can access it (although if your research group has its own data volumes, we request that you use that preferentially.) Research-owned lustre space is just a reservation of the total lustre space for that research group, so there is no user-visible difference between storage owned by research groups and DIT-owned lustre storage. The NFS data volumes are better optimized for performance than the home space volumes, and the lustre filesystem is still better optimized. But jobs doing heavy I/O should still seriously investigate using local scratch space instead. Neither lustre nor NFS mounted data volumes are backed up to tape, but allows for more storage than home space volumes. Still, remember to store critical data on the home space which is backed up. Policies related to DIT provided data space.
- Research group provided data space: Some research groups have purchased additional data space for use by their members. This can be separate NFS mounted data volumes, or part of the lustre filesystem. In the former case, these are special data volumes and access is limited to members of the groups contributing to their purchase. Research groups can also buy lustre storage, which is added into the total lustre pool and then an amount equal to the contribution is reserved for your groups use. Policies related to research group provided data space.
- local scratch space:
Each compute node has local scratch space available as
/tmp. For the original Deepthought cluster, this varies significantly depending on the node, but is at least 30 GB. For Deepthought2 nodes, this is amount is about 750 GB. This space is available for use by your job while it is running; any files left there are deleted when the job ends. This space is not backed up, and files will be deleted without notice when job ends. This space is only visible to the node it is attached to; each node of a multinode job will see its own copy of
/tmpwhich will differ from
/tmpon the other nodes. However, being directly attached, this space will have significantly better performance than NFS mounted volumes. Policies related to local scratch space.
- DIT-provided longer term storage: Unfortunately, the options for archival storage are rather limited at this time. However,
- On the original Deepthought cluster, there is about 60 TB of iSCSI storage is available for storing data which although important is not being actively used by jobs. Please see the section on archival storage on the Deepthought cluster for more information.
- Archival data can also be stored on Google's G Suite drive. This can hold large amounts of data, although transfer times can be less than optimal. See the section on archival storage using G drive for more information.
These options are available for the storage of files and data not associated with active research on the cluster (such files should not be stored in lustre or the /data volumes). This is useful for data which needs to be kept but rarely accessed, e.g. after a paper is published, etc. While there is no time limit on how long data can stay in these locations, it is still requested (especially on the iSCSI storage on DT1) that you delete items after they are no longer needed. Policies related to longer term storage
Policies on Usage of Home Space
- Do NOT start jobs from your home directory or subdirectories underneath it.
Run the jobs from lustre or from
- Jobs should not perform significant I/O to/from homespace volumes. Use
/data/...volume or the locally attached scratch space(
- Delete or move off the HPCC any files which are no longer needed or used.
- There is a 10 GB soft quota on home directories. This soft quota will not prevent you from storing more than 10 GB in your home directory, however, a daily check of disk usage will be performed and if you are above the quota you will receive an email requesting that you reduce disk usage within a grace period of 7 days. The email reminders will continue until usage is reduced or the grace period is over. If you are still overquota at that time, system staff will be notified and more severe emails will be sent, and unless the situation is remedied prompty system staff may be forced to take action, which could involve relocating or delting your files. This soft quota approach is being taken to ensure all HPCC users get fair access to this critical resource without unduly impacting performance on the cluster and allowing you some flexibility if you need to exceed the 10 GB limit for a few days.
Policies on Usage of Division of Information Technology Provided Data Space
- Jobs should avoid doing extensive I/O to/from
/data/...volumes as NFS performance will degrade affecting both your jobs and other users of the system. Please look into using lustre or the locally attached scratch space(
/tmp) if at all possible; contact DCS if you need assistance with that. This is especially a concern if you have a lot of processes (either a lot of small jobs, or big jobs where each task is doing I/O) accessing these volumes heavily.
- Delete or move off the HPCC any files which are no longer needed or used. This space is intended to provide temporary storage for the support of jobs running on the system; it is not for archival purposes. Files which are not actively being used by computations on the cluster must be removed prompty to ensure these resources are available for other users.
- When the filesystems are filling up, systems staff will send out emails to the largest consumers of space on the affected filesystems requesting that you reduce your footprint. You are required to comply with these requests and promptly reduce your disk usage on the specified filesystems.
- Systems staff reserve the right to delete files on the lustre and DIT provided data volumes that are more than 6 months old without notice. While we hope to not need to invoke this option often, this is needed and will be done if users are not complying with the previous policy items. So delete files when they are not needed by jobs, and when you receive requests to do so from system staff to avoid having us delete files for you.
- Files in lustre or in the
/data/...volumes are not backed up.
The DIT provided data space, both lustre and
Files older than 6 months on the lustre and
Policies on Usage of Research Group Provided Data Space
- Jobs should avoid doing extensive I/O to/from
/data/...volumes as NFS performance will degrade affecting both your jobs and other users of the system. Please look into using lustre or the locally attached scratch space(
/tmp) if at all possible; contact DCS if you need assistance with that.
- You should still delete or move off the HPCC any files which are no longer needed or used, so as not to adversely impact other users of your research group.
- Files are not backed up.
The research group provided data stores are NOT backed up by DIT.
Policies on Usage of Locally Attached Scratch Space
- Please have your jobs use locally attached scratch space (
/tmp) wherever it is feasible. This generally offers the best disk I/O performance. Contact DCS if you have questions or need assistance with that.
- Files in locally attached scratch space are not backed up.
- Files in locally attached scratch space are deleted upon termination of the job.
- Although all files in
/tmpthat belong to you will be deleted when you no longer have any jobs running on the node, it is good practice to delete files yourself at the end of the job where possible. Especially if you run many small jobs that can share a node; as otherwise it can take some time for the automatic deletion to occur and that can reduce the available space in
/tmpfor other jobs.
Any files you own under
DIT-provided longer term storage
/data/dt-archiveNvolumes and Google's G drive are the ONLY DIT-provided storage where it is permissible to store files and data not associated with active research on the cluster. It can be used to archive data e.g. that needs to be kept for a while after a paper is published.
/data/dt-archiveNvolumes are only available from the login nodes of the original Deepthought cluster. They are NOT available from the compute nodes.
- Do not use this storage for active jobs.
- These volumes are NOT backed up.
- Google's G drive storage is NOT on campus, and as such there may be restrictions on what types of data is allowed to be stored there (from a security perspective). Please see the Google drive service catalog entry for more information regarding this.
- All of the data owned by the user (both in their home directory and/or in their lustre directories) is "quarantined". I.e. it is relocated and access to this data is disabled for all users, but it is still consuming space on the filesystem. This is to ensure anyone who is using this data, whether cognizant of their use of it or not, should quickly notice that it is gone, and so hopefully things can be resolved before the data is permanently deleted. If you need (or think you need) access to data from someone whose access was recently disabled,
- For every allocation the user who previously owned the data was in just before the user was disabled, the point-of-contacts (PoCs) of those allocations will receive email informing them that the data previously owned by that user is slated for deletion. Emails will be repeated monthly as long as the data remains in "quarantined" (or the PoC gives approval for early deletion of the data).
- NOTE: Only the PoCs for allocations that the user belonged to just before being disabled will receive these notifications. E.g., if an user is a member of allocations AllocA and AllocB, and then is removed from AllocA (typically at the request of a PoC for AllocA), and then some weeks or months later is removed from AllocB (either due to expiration of account or at request of a PoC), only PoCs for AllocB will receiving recurring emails about the "quarantined" data. PoCs for AllocA will receive email at the time that the user was removed from AllocA, and this email will mention that they should make arrangements regarding transferal of ownership of any data belonging to AllocA, but no data is "quarantined" at that time (because the user still has access to the cluster).
- They data will remain in "quarantine" until one of the following
- The expiration date for the data has passed. The expiration date is set to 6 months after the account was originally disabled. At this point, the data has been "quarantined" for (and therefore not accessed for at least) six months, and so is beyond the age at which DIT staff reserve the right to delete anyway.
- PoCs representing all of the allocations to which the user had been associated just before being disabled have given approval for the early deletion of the data. This is to allow freeing up of resources ahead of the normal 6 month policy, but is only done if representatives of ALL allocations agree to it.
- If any PoC from an allocation the user had been associated with just before the account was disabled requests that the data be transferred to another user, the data will be transferred (as per HPC policy, all data belongs to the allocations to which the user is a member). This does not require consent of all PoCs involved, and it is assumed that should multiple PoCs need different parts of the data things can be worked out in a friendly manner.
- The SU limit for the allocation is reduced to a trivial amount, effectively preventing the allocation from being used to submit more jobs. It will remain at this level for one quarter (three months) to allow users to continue to have access to the cluster for the purpose of moving data off the cluster.
- The PoC for the allocation receives email stating that the allocation has expired, and to apply to renew the allocation if it is still needed or to use the one quarter grace period to move any needed data off the cluster.
- If the allocation is not renewed, at the end of the one quarter grace period MARCC staff will be instructed to delete the allocation as per their standard policies. I do not know exactly what such policies are (contact MARCC staff if you need details), and these probably will give you some additional time to access to the cluster to transfer data. However, as it is not clear what the MARCC policies are, I strongly suggest you use the one quarter grace time from UMD to transfer all data that you need.
Policies Regarding Data of Former Users
Over time, users on the cluster will come and go. Because of the large volume of data that some users have, it is necessary to have policies regarding the disposal of this data when users leave the university or otherwise lose access to the cluster in order to protect potentially valuable data but also prevent valuable cluster resources from being needlessly tied up due to files from users no longer on the cluster.
The exact processes vary depending on the cluster:
Disposal of Former User Data on the Deepthought2 Cluster
All active users on the DT2 cluster belong to one or more allocations, and lose access to the cluster when they no longer are associated with any allocations, be it because they ceased being associated with the University or the research group owning the allocation, or the allocation expired. When this happens:
Disposal of Former User Data on the MARCC/Bluecrab Cluster
Users on the Bluecrab cluster are typically only associated with a single allocation, and Bluecrab provides allocation-centric storage for much of the storage resources. Data in the allocation-centric directories will remain accessible to the allocation for the lifetime of the allocation, even if the account of the original user/owner has been disabled.
Data in the user-centric spaces (e.g. home directory and personal lustre workspaces) for users whose accounts have been disabled will be handled as per MARCC policies. Contact MARCC staff for details if needed.
When an allocation on MARCC expires, the lose access to the cluster when they no longer are associated with any allocations, be it because they ceased being associated with the University or the research group owning the allocation, or the allocation expired. When this happens: