Division of Information Technology (DIT) of the University of Maryland (UMD) maintains a number of High Performance Computing (HPC) clusters for use by researchers at UMD. These include the Zaratan and the Juggernaut clusters.
The Zaratan cluster is the University's flagship cluster, replacing the previous Deepthought2 cluster, and the initial hardware purchases were funded by the Office of the Provost at UMD, with contributions from the Engineering and CMNS colleges. Access to the Zaratan cluster is available to all faculty at the University, along with their students, post-docs, etc., through a proposal process.
The Juggernaut cluster consists primarily of hardware which was contributed by various research units --- access to this cluster is generally restricted to members of the contributing units. It was built primilarly to support researchers who needed additional HPC resources beyond what was available on the Deepthought2 cluster, and which because of sundry data center issues could not be added to that cluster. We are not currently planning to expand this cluster, and instead will be redirecting any expansion requests to the new Zaratan cluster. Indeed, the Juggernaut hardware likely will be merged into the Zaratan cluster in the near future.
To get access to one of the HPC clusters, you need to be granted access to a project and an associated allocation on that cluster. If you were granted an allocation on one of the HPC clusters, you should already have access to the cluster, and should have received email welcoming you to the cluster and giving basic instructions. For more detailed instructions you can view the instructions on using the web portal or instructions on using the command line interface.
If you do not have an allocation of your own, but are working with a faculty member who has an allocation, any manager for that allocation (e.g. the allocation owner or someone they delegated management rights to) can grant you access to their allocation. After that is done, you should be able to log into the cluster either via the web portal or via the command line.
Allocations on the Juggernaut cluster are basically only for those units/research groups which have contributed hardware to the cluster.
Allocations on the Zaratan cluster are available to all faculty at UMD. If you (or your faculty advisor for students, etc) do not have an allocation on the cluster, the rest of this section will explain how to obtain an allocation. For students, post-docs, and other non-faculty members, please have the faculty member you are working with apply for the allocation, and then grant you access to it..
Allocations on the HPC clusters consist of allotments of compute time and storage on the cluster. Compute time is measured in Service Units or SU . Essentially, one SU is the use of one CPU core for one hour, with some additional factors coming in to account for differing CPU speeds, excessive memory use, and/or the use of GPUs . Typically, we use units of kSU, where 1 kSU = 1000 SU. So a job running on 1 CPU core for 4 days would usually consume 1 core * 4 days * 24 hr/day = 96 SU. Another job running on 16 cores for 6 hours would also consume 96 SU.
The total number of SUs available on the cluster in a given time period is limited; e.g. the total number of SUs per quarter can basically be computed by multiplying the total number of CPU cores by the number of hours in a quarter. Anytime a CPU core sits idle for an hour represents an SU which is forever lost. We limit the number of SUs allotted to allocations in order to try to keep the wait times reasonable while still keeping the cluster well utilized. Although the large number of users on the clusters and the law of large numbers tend to cause the distribution of usage to be somewhat uniform over time, but for larger allocations we dole out the SUs quarterly to further encourage this.
Allocations also get an allotment of storage. All of the HPC clusters have a high performance file system (HPFS) or scratch tier, which is designed for the temporary storage of data being actively used by jobs. This storage tier is highly optimized so that it can, when used properly, support thousands of processes doing heavy I/O against it. The Zaratan cluster also supports a second, larger storage tier, the SHELL medium term storage tier , which allows for the storage of large data files which are inputs to or outputs from jobs, but for jobs which are not actively running, e.g. so that you do not need to spend days downloading large files just before submitting a job.
All faculty at UMD are eligible for a basic allocation from the Allocations and Advisory Committee (AAC) consisting of 50 kSU per year, with 0.1 TB of storage on the HPFS/scratch tier, and 1 TB of storage on the SHELL medium term storage tier. This basic allocation is available at no cost to the faculty member. All one has to do to obtain such an allocation is fill out an application. Because we use a single application for both this basic application level and for requesting additional resources from the AAC, there are a fair number of questions on the application. For the basic application, you can leave many of the questions blank (or put "N/A" in the answer) --- faculty members will be awarded the base allocation just by requesting it. However, it is helpful if you take a few moments and try to answer these questions to the best of your ability --- your answers will provide us with some insight into what you are trying to do and we might be able to offer useful suggestions. In addition, looking at the questions will be helpful if and when you need to request additional resources from the AAC; once you go beyond the "basic" allocation, we will require satisfactory answers to all of the questions in the form, and the answers for some fields require information about your jobs and their performance that you should be collecting while using your basic allocation. So being at least aware of the questions that you will need to answer when applying for more resources is useful. As always, if you need assistance with either the basic allocation or when requesting additional resources, please do not hesitate to contact the HPC team.
Atlhough this basic allocation might suffice for some small projects, and is useful if you wish to explore whether high performance computing would benefit your research, we expect most users will need more resources for their work. There are several ways to obtain additional resources.
Generally, the next step is to apply for additional compute and storage resources from the campus Allocations and Advisory Committee (AAC). This committee consists of a number of UMD faculty members with extensive experience in research involving the use of high performance computing, who will evaluate such requests to ensure proper and efficient use of the university's valuable HPC resources.
The AAC can authorize additional resources, up to 500 kSU of compute time, 10 TB of high performance/scratch storage, and 50 TB of SHELL/medium term storage, at no cost to the faculty member.
If even more resources are needed, there are basically two options, which we elaborate on below:
The units which have pools of HPC resources to allocate, along with their contacts and which clusters they have pools on, are as follows:
Unit | Contact Person | Zaratan? | Notes | |
---|---|---|---|---|
A. James Clark School of Engineering | Jim Zahniser | X | ENGR is doing some cost recovery | |
College of Computer, Mathematical and Natural Sciences (CMNS) | Mike Landavere | X | Delegated to Departmental Level (see below) | |
CMNS | Atmospheric and Oceanic Science | Kayo Ide? | X | |
CMNS | Astronomy | Benedikt Diemer | X | |
CMNS | Biology | Wan Chan | X | |
CMNS | CBMG | Wan Chan | X | |
CMNS | Chemistry | Caedmon Walters | X | |
CMNS | Computer Science | Jeanine Worden | X | |
CMNS | Entomology | Greg Hess | X | |
CMNS | Geology | Phil Piccoli | X | |
CMNS | IPST | Alfredo Nava-Tudela | X | |
CMNS | Joint Quantum Institute | Jay Sau | X | |
CMNS | Physics | Jay Sau | X |
|
Please note that all policies/procedures/etc related to the
allocation of these resource pools by the units above are completely
up to the unit; the Division of IT is not involved in the policies or
decision-making process. Also note that while we try to present accurate
and up-to-date information regarding these matters, the units
are not required to inform us before making changes, and so for the
most accurate and definitive information we suggest you contact the relevant
people in the unit. To our knowledge, Engineering is doing some cost
recovery on allocations from their pool, but all of the other units
above are not directly charging faculty members for allocations
granted from their resource pools.
|
All faculty are eligible to purchase additional HPC resources. Please see the cost sheet for pricing. These monies from these charges will be used to maintain, enhance, and expand the cluster.
You are not limited to just one of the above options. Indeed, the same application form is used for all allocation types managed by the Division of IT (i.e. everything but the allocations from college/departmental pools), and you only need to apply once for the full amount of resources you require (we will automatically grant the base allocation, submit for review by the AAC any additional resources requested (up to the cap), and provide a quote for the remainder. Compute time from a purchased allocation will not be available until arrangements for payment have been made; if you request an amount of resources which would require payment by mistake, it is not a problem as it will be corrected when we contact you to arrange for payment.
All allocations have an expiration date, which is at most one year from the date of approval. The allocations from DIT/AAC can be renewed, but such will require the submission of a renewal application. For "base" allocations, renewals are essentially automatic; for AACallocations the AAC will want to see more detail, including a summary of what was accomplished with the previously awarded resources. We also request all PIs update the list of publications in ColdFront.
The following table summarizes the various options for obtaining allocations and the limits which apply:
Allocation Class | From | Compute Time | Scratch/HPFS | SHELL/MTS | URL for appyling |
---|---|---|---|---|---|
"Free" Base allocation | DIT | 50 kSU/year | 0.1 TB | 1 TB | Campus AAC application form |
"Free" AAC allocation | AAC | up to 500 kSU/year | up to 10 TB | up to 50 TB | Campus AAC application form |
College/Deparmental Allocations | CMNS | up to the College/Department | See College/Departmental pool table for a list of units and contacts |
In addition to the "free" allocations from the Allocations and Advisory Committee (AAC), it is also possible to purchase additional resources from the Division of Information Technology. The pricing model depends on a number of factors, including the amount of resources being requested, the amount of excess capacity currently on the cluster, and the time frame for the request.
We strive to keep the capacity of the cluster at or above the total resource commitment as otherwise there could be serious issues with shortages of resources (e.g. long wait times in the queues for jobs, disks running out of space, etc). If your requested resources can be met from existing excess capacity (i.e. capacity on the cluster which has not been allocated to other users via their purchases or any of the allocation avenues described above), we can typically grant the request rather quickly.
Currently (Jan 2023) we have an excess capacity about 15,000 kSU/quarter and about 200 TB of scratch and 1 PB of SHELL storage. For requests that can be satisfied from our excess capacity, the pricing model for additional resources is as follows:
Resource | Unit | Price |
---|---|---|
Compute | 1 kSU for 1 quarter | $2.32 |
Scratch storage | 1 TB for 1 quarter | $8.28 |
SHELL storage | 1 TB for 1 quarter | $5.57 |
These resources are all tied to a specific quarter in which they are valid (this determination will be made at the time of purchase). Any resources which are not used in the quarter specified do not roll over, they simply vanish. You can make a purchase of e.g. 1000 kSU for a year, but this will be broken down into a set number of kSU for each quarter in that year; by default we will allocate 250 kSU/qu for each of the four quarters, but you can request a different quarterly allotment.
Similarly, you can purchase storage resources for a year or longer, and specify how wish the storage to be allocated quarter by quarter (although since files tend to be more permament than jobs, we strongly encourage either equally dividing among the quarters, or having the allotted amount increase with each successive quarter). Thus to get an additional 1 TB of scratch space for 3 years (or 3 years * 4 quarters/year = 12 quarters), you would need to pay 12 times the $8.28 quarterly price. At the end of the contracted period, unless you purchased additional space in another contract, the added space will go away (i.e., your project's quota on the relevant storage tier will return to the value it was prior to the purchase); this will likely result in your project being overquota unless you deleted data or transferred data elsewhere. As per HPC policy, in such cases you will be warned of the overage and asked to resolve the matter in a timely fashion (typically a week or so) --- failure to do so may result in your project being charged for additional quarters of storage use (at whatever the going rate at the time is, which might be more than the rates in the initial contract).
Note: Data stored on the various storage tiers is still subject to the HPC policies on the use of the respective storage tier, even if your project has purchased additional storage. For example, the policy that all data on the scratch tier must be in support of "active" jobs on the cluster (i.e. jobs that just finished, are actually running or in the queue, or jobs to be submitted in the near future) still applies to all data on the cluster, even if your project purchased additional scratch space.
In order to use the cluster, all allocations need some mix of compute and storage. Since many users are a bit unsure as to the relative amounts, we offer a "balanced" package of compute and storage resource, for $2.86 per "balanced unit", with a balanced unit consisting of:
Again, this is just a convenient ratio to buy compute and storage in, reflecting the ratio of those resources in the initial cluster configuration. The terms are the same as previously mentioned; and there is no "discount" for buying in "balanced units" (the prices listed above our actually our fully discounted prices so we cannot go lower). This is just a recommended ratio for would be purchasers who know they need a bit of all resources but are not sure how much of each they should purchase --- if you happen to know that your intended research will be need more of one resource and less of another, you should adjust accordingly.
If you request more resources than what we have in our current excess capacity, additional hardware will need to be purchased to accommodate your request. We will need to obtain quotes from the vendor in order to work out what the costs will be. The need to purchase hardware will also mean that there will be some delay before the hardware actually arrives and can be integrated into the cluster. Furthermore, if your are only requesting the additional resources for a short time (compared to the estimated usable lifespan of the hardware involved), there will be surcharges applied to cover the estimated overhead before we can sell the resources as excess capacity. This is standard industry practice; even large cloud providers like Google and AWS charge substantially more for short term purchases compared to purchases with a longer commitment, and we are small compared to them.
You can submit a single application for an allocation, AAC allocation, and paid allocation, and if you are ready to request all of them at the same time that would be preferred. Of course, if your needs (or your awareness of your needs) change over time, you can submit multiple applications to adjust the allocation sizes. Please note that although the form does not enforce limits, we do track and enforce the annual limits on compute time, etc.
Although the compute time requested in the application and as awarded is on an annual basic, the actual compute time might be meted out either annually or quarters. This decision is made by DIT based on the size of the allocation; smaller allocations will have compute time doled out annually, and larger ones quarterly --- this is to encourage the use of allocations to be more spread out over time. E.g., a 50 kSU "base" allocation will typcially be meted out annually; when it is awarded you will receive 50 kSU to use within 365 days from the date of award. But if you purchase 1000 kSU (or receive such from a departmental/college pool), typically you will get 250 kSU/quarter for the 4 quarters following the date of the award. SUs that are not used at the end of a quarter (or the end of the year for annually meted out allocations) will simply disappear, they will not carry over into the next time period.
If you have a "base" and an "AAC" allocation, we will typically try to consolidate these into a single allocation, representing a single Slurm allocation account; this will generally be more useful than multiple Slurm allocation accounts. We cannot consolidate allocations with different sources, schedules (quarterly or annual), or expirations, so generally college, departmental, and paid allocations will remain as distinct allocations.
Unlike CPU resources, disk space does not "regenerate" with time. Once a file is placed on the file system, it remains there, consuming disk space, until someone removes it. Allocations come with a limited amount of storage. Typically, for each storage resource on a given cluster we sum up the allowances for each allocation into a single limit for the project. E.g., if your AAC allocation granted you 2 TB of HPFS storage, your college allocation granted you an additional 1 TB, and your purchased an allocation granting another 1 TB, normally this would be combined to give a 4 TB storage limit for all members of any of those allocations. The storage allotment remains until the allocation expires (or the storage allotment for the allocation changes); it such events reduce your storage allotment causing your usage to exceed your allotment, the PIs and managers for the project will be contacted to inform them of the issue and request that it be rectified in a timely fashion (typically a week or so). If you receive such a notification and need assistance, etc. in rectifying the matter, please contact the HPC team.
For the scratch tier , a true quota system is imposed, which limits the amount of data which members of the allocation can store on the file system. By default, we do not impose quotas on individual users, although such can be done upon request.
The SHELL storage tier is volume based. Typically one volume will be created for the root of the SHELL project directory for the project, along with a volume for each member of the project. Additional volumes may be created on request. Each volume will have a limit on the amount of data that can be stored in it; this has a default value but within reason the default can be changed as well as the limit for specific volumes. Since often the amount of data on a volume will be only a fraction of its limit, we allow for oversubscriptions within reason (i.e. the total of the limits on all of the volumes above can exceed the limit for the project); this is fine as long as the total amount of space used on all the volumes fits within the project's limit.
Resources, paid or otherwise, are allocated ahead of time, and so for the most part you will not get billed for resources previously consumed. The exceptions are for storage. If an allocation expires and is not renewed, or has its storage allotment adjusted downward in a renewal or otherwise, then it is possible that the total amount of data for the project stored on the resource can exceed what was allotted to the project. Also, the volume design of the SHELL storage tier , combined with oversubscription, means that it is it is possible to store more data under the project's SHELL directory than was allocated to the project. Whenever the storage resources used by a project exceeds what was allocated to the project, warning emails will be sent to the PI and all members of the projects informing them of the quota violation and requesting that it be resolved within one week. Resolution can be made by reducing the amount of data stored, renewing the expired allocation, purchasing additional storage, etc. If you need help figuring out how to rectify the situation, or if there are extenuating circumstances which might warrant an extension of the time limit to resolve the matter, please let us know. While our responsibilities towards other users on the cluster will not allow us to ignore such overages, we are willing to work with you to find a mutually acceptable solution. If you are unable to resolve the matter in a reasonable time and are not working with us towards finding an acceptable resolution, we will be forced to bill you for the additional storage used.
This section is for managers of projects on one of the DIT maintained HPC clusters. If you are NOT a designated manager, i.e if you are only a member, or not even a member of the allocation, do NOT follow these steps. We will not honor requests made from people who are not managers for the project the allocation is in. If you are not the manager for the project, find the manager and have them make the request.
This can be done by designated allocation/project managers in either of two ways:
Either way, if there are multiple allocations within the project, it is strongly recommended that the membership lists for all such allocations be the same. Membership in any of the allocations for a project grant the user full access to all of the scratch and SHELL storage allotted to the project, and generally it is recommended that users have access to the compute time for all of the allocations belonging to a project as well.
PIs and managers of projects are now able to view and modify the membership lists of their allocations directly using the ColdFront web-based allocation management GUI.
To add an user to your allocation, there are only a few of steps:
Note: You must add the user to both the project and at least one allocation for them to get access to the HPC cluster. Adding the user to the project does not do much, it basically only makes them eligible to be added to allocations for the project. It is adding the user to the allocation which actually grants them access to the cluster and allocation resources.
To delete users from your allocation(s), the process is basically the reverse of the add user process:
Please note that it takes an hour or two for the provisioning process to complete. The deprovisioning process is currently somewhat manual, so that might normally take a couple of days. Please submit a ticket to HPC staff if the removal of user access is more urgent.
Basically, one of the points of contact for the allocation on the just needs to send email to hpcc-help@umd.edu requesting that the user be added to the allocation. The email should come from the point-of-contact's official @umd.edu email address), and also specify:
Note that certain subdomains of umd.edu (e.g. cs.umd.edu, astro.umd.edu) are NOT part of the unified campus username space, and as those subdomains are NOT maintained by DIT, are not usable by us to uniquely identify people. E.g., jsmith@cs.umd.edu might or might not be jsmith@umd.edu, so we cannot reliably map jsmith@cs.umd.edu to a specific person.
The DIT maintained HPC clusters currently require all users on the cluster to have active Glue/TerpConnect accounts. This condition should generally be true for most if not all users automatically, but if you are unsure or need to manually activate your Glue/TerpConnect account, please seethis Knowledge base article (KN0010073) . If you submit a request for users without a TerpConnect account, you will just get email back telling you they need to get a TerpConnect account first.
Requests to delete users from the allocation can be handled similarly. Here it does not matter whether the user's TerpConnect account is still active. If the user is not associated with any allocations other than yours, their access to the cluster (as well as to charge against your allocation(s)) will be revoked, and all access to their HPC home directory and any directories on lustre or data volumes will be revoked and those directories slated for deletion. If there is data which should be retained, you should mention that in the email so we can look into reassigning ownership. If the user has access to other allocations, only their ability to charge against your allocation will be revoked, and we will by default not do anything with respect to their home or data files. You should contact the user about any transfer of data that is required (and you can contact us if assistance is needed).
Certain contributors (e.g. Engineering, CMNS and some of its departments) have not allocated all of the resources they are entitled to from their contribution to the Zaratan cluster, and are instead periodically creating suballocations carved from these unallocated resources.
To create new suballocations, or modify the resources granted to existing suballocations, the points of contact for the contributions with unassigned allocations should send email (from their official @umd.edu email address) to hpcc-help@umd.edu including the following information:
Again, all points of contact and members of the suballocation MUST already have active Glue/TerpConnect accounts before submitting the request. See here for information and instructions on activating TerpConnect accounts.
Also, all such requests MUST come from a designated point of contact for the parent contribution.
All applications for allocations of HPC resources from the Allocations and Advisory Committee (AAC) are made via this form. Do not use that form for:
All requests for allocations from the AAC must come from faculty members, although we allow students, etc. to submit the form on behalf of their faculty advisor as long as the advisor updates the request to indicate their approval. Note: while most students will have access to use this form by default, some students, especially those without graduate or research assistanceships, might not. If you fall into this category, please open a ticket with HPC staff indicating such, and we will update records to provide you access. Faculty members are eligible for a single allocation from the AAC, although this allocation may be renewed.
Note:Faculty members are eligible for at most a single allocation from the AAC (excluding allocations for classes they are teaching). Students, etc. should communicate with their advisor and coordinate with their colleagues before applying to the AAC for an allocation for their advisor. If there are multiple projects, perhaps one for each student, these should still be consolidated into a single proposal to the AAC. If the AAC receives multiple applications from different students of the same advisor will appear as a lack of coordination and planning and will not leave a favorable impression with the AAC.
Allocation requests are reviewed, first by HPC staff and then by the AAC, with the level of scrutiny given increasing as the cumulative amount requested by (or on behalf of) the faculty member (for the year) increases. If there is information missing or insufficient detail, the application will be sent back to the applicant for more information. For the first 50 kSU requested by a faculty member in a given year, the application is almost automatically approved. Requests for resources beyond that will generally performance benchmarks based on actual runs on the cluster. Therefore, new users to the cluster should start with a 50 kSU developmental allocation request, and then use the awarded resources to start their research and simultaneously start collecting the performance benchmarks for any future request for additional resources.
Allocations from the AAC are one-time grants of resources, valid for at most one year. They do not get replenished automatically. If needed, you can submit a "renewal" application requesting additional resources before the expiration of your current allocation; generally in such cases the amount of the new request will be added to the previous request(s) for the cumulative amount requested from the AAC for the year, and the expiration date will remain unchanged (i.e. one year after the initial request was granted). The AAC tries to not oversubscribe the computational resources of the cluster; because of this, any award made to one research group correspondingly decreases the amount of resources available for the AAC to award to other research groups, so it is incumbent on the AAC to ensure that resources are being used well and efficiently, hence the application process and the scrutiny on renewal applications. There are limits on the cumulative amount of resources a faculty member is able to receive from the AAC in a given year; if resources beyond those limits are required you must either:
If this is the first time you (or your advisor if you are requesting an allocation on behalf of your advisor) is requesting an allocation from the AAC, please see the quickstart section for developmental allocations. You should also go there if you are requesting a renewal of an expired allocation at the 50 kSU/developmental level (but not if you are requesting an additional 50 kSU for a non-expired allocation). If you are requesting additional resources for a non-expired allocation, or seeking to renew an expired allocation for more than 50 kSU, please see the quickstart section for renewal applications.
There also is a section providing detailed descriptions of each field in the form and what is expected to go in it which you can refer to if you need more information than is provided in the "quickstart" sections. And as always, if you need assistance with the form, either for the basic allocation or when requesting additional resources, please do not hesitate to open a ticket with the HPC team requesting assistance. We would be happy to assist you.
All faculty members are eligible to receive a free "developmental" allocation from the AAC essentially just for the asking. These developmental allocations consist of 50 kSU , 100 GB of scratch storage , and 1 TB of SHELL storage . Faculty members are eligible to receive a developmental allocation once, and can renew it annually.
These developmental allocations are intended for users to explore if HPC techniques are useful for their research, and might even suffice for research with small computational needs. For research with larger computational needs, the AAC will usually first award a developmental allocation to allow the research to start, and to collect data showing that the research program is well-considered and using HPC resources efficiently; such data can be used in a follow-up "renewal" application for additional resources.
Because we use a single application for both this basic application level and for requesting additional resources from the AAC, there are a fair number of questions on the application. For the basic application, you can leave many of the questions blank (or put "N/A" in the answer) --- faculty members will be awarded the base allocation just by requesting it. However, it is helpful if you take a few moments and try to answer these questions to the best of your ability --- your answers will provide us with some insight into what you are trying to do and we might be able to offer useful suggestions. In addition, looking at the questions will be helpful if and when you need to request additional resources from the AAC; once you go beyond the "basic" allocation, we will require satisfactory answers to all of the questions in the form, and the answers for some fields require information about your jobs and their performance that you should be collecting while using your basic allocation. So being at least aware of the questions that you will need to answer when applying for more resources is useful.
For a developmental allocation, we ask that you please focus on the following fields of the form:
New Allocation
. If you are requesting a renewal of or additional
resources for an existing allocation, select "Renewal Allocation".
Again, you are encouraged to fill out any other fields to the best of your ability. This is useful to give you a better idea of what will be required in any applications for additional resources, as some fields (especially the Code Use and Scalability and SU Justification sections, both of which require some performance benchmarks that the AAC assumes you will be collecting with this initial award).
As always, if you need assistance with the form, either for the basic allocation or when requesting additional resources, please do not hesitate to open a ticket with the HPC team requesting assistance.
If you have an existing allocation and are requesting to extend it for another year, or are requesting additional resources for that allocation, you need to submit a "renewal" application to the AAC. This section gives a quick overview of the fields you need to pay attention to.
NOTE: if you are simply seeking to renew an expired (or about to expire) developmental (50 kSU) allocation for another year at the same developmental (50 kSU) level, please see the AAC Application Quickstart: Developmental Allocations section as it is more relevant.
If the allocation being renewed is expired or about to expire shortly, then (if approved) the renewal will cause the expiration date to be set to a year in the future, and cumulative resource count for the allocation for the year will be set to whatever was granted for the renewal.
If the allocation is not expired/about to expire, then the expiration date is not changed (i.e. it will still be one year from when the allocation was first granted or the last time it was renewed after it expired or just before it was going to expiration), and any resources awarded are added to the cumulative amount awarded for the year. This will impact the amount that can be requested in future renewal requests until the allocation expires; see for limits on the cumulative amount of resources a faculty can receive in a given year. I.e., if a faculty member makes an initial request for 50 kSU in January, then a renewal request for 250 kSU in March, and another request for 250 kSU in August of the same year, this will be a cumulative request of 550 kSU, causing the faculty member to have "max-ed" out the amount available from the AAC until the following January.
For a renewal allocations, you must fill out all of the fields in the form. A discussion of all of the fields can be found below, but we suggest you focus your efforts on the following fields in order:
Renewal Allocation
. This will add the Past Results
and Publications fields to the form, which you will need to
address.
The AAC will also wish to see that resources are being used efficiently. Parallel codes typically show close to linear improvements to performance as more resources are made available to the job, but only up to some threshold. Adding resources beyond that threshold yields little if any performance gains. This threshold is best determined empirically. The AAC will wish to see that you have run tests to determine the scalability of the codes you are using by running performance benchmarks, and that your proposal will be for jobs using an optimal amount of resources.
Again, you are encouraged to fill out any other fields to the best of your ability. This is useful to give you a better idea of what will be required in any applications for additional resources, as some fields (especially the Code Scalability and SU Justification sections, both of which require some performance benchmarks that the AAC assumes you will be collecting with this initial award).
As always, if you need assistance with the form, either for the basic allocation or when requesting additional resources, please do not hesitate to open a ticket with the HPC team requesting assistance.
The AAC is comprised of faculty members from various disciplines with extensive experience in the use of HPC for advancing research, but do not assume that have extensive experience in your particular discipline. Please address your comments here accordingly.
Development is for "base" allocations Small and Medium are for allocations which can be awarded by the AAC. Large is for allocations which require payment.
Whatever you set here will default the Requested kSU field.
Your answers here helps us to better evaluate your application as well as improve our understanding of how people are using the clusters, so accurate answers are appreciated. However, we do not restrict your access to software based on this answer; if there is an application "foo" on the list that you did not select when filling out the application, but after the allocation awarded you discover that it would be helpful to your research, the fact that you did not select it will not prevent you from using it. We do ask that if you continue using it that you include it when you renew the application.
Please note that the presence of a package in either this or the Software Requested fields does not constitute a promise on the part of the HPC staff to install said software, even if the application is approved. The HPC team strives to make a large library of software packages available to our users, and will make reasonable attempts to install packages on request, but not all packages install nicely, or even are suitable for system wide installations.
Also note that the AAC and DIT do not generally provide licenses for licensed software. The HPC team will attempt to install licensed packages on request, assuming the requestee can provide proof of license (and likely they will need to provide installation media). Some of the packages in the drop down list are proprietarily licensed --- in a few such cases they are covered by a campus wide site license, but many such cases are only covered by licenses granted to certain departments and/or research groups. Your inclusion on the application, even an approved application, does not grant you access under these licenses --- we will open a discussion with you about licensing in such cases.
|
Do NOT purchase licenses for software you intend to
use on one of the UMD maintained HPC clusters without consulting
with the HPC team first. Not all licenses are suitable for use on
an HPC cluster, and we do not wish you to spend money on a license
you cannot use on the cluster. Please
contact us any such purchases.
|
This section is where you describe your computational strategy, contrasting with the Research (Lay) Abstract section which is more about the science. In particular, for allocations from the AAC, the AAC wishes to see a detailed quantitative justification of the compute time requested.
In many cases, this can be as simple as an estimate of the number of jobs that will be required to achieve your stated research Milestones, and an estimate of the SU cost for each job. For the estimate of the number of jobs, please give a sentence or two describing how you arrived at that number. For the SU cost of each job, please give the specifications of a typical job, including the number of CPU cores requested, the amount of CPU memory requested, the number and type of GPUs requested, and the average walltime and SU cost for the job. The last two should be based on actual runs of similar jobs performed on Zaratan. If there are several different types of jobs to be run, break this down by job type. Remember to include any multipliers if you are using GPUs or more than the average memory per core. When renewing an application or requesting additional time, the AAC typically would like to see estimates of SU consumption by job based on actual runs on the cluster where possible.
If you are requesting an increase to an existing allocation, it would be helpful to mention such, along with the existing allotment of compute time and the amount of additional compute time being requested. Generally, you do not need to re-justify the SUs already allotted to you; i.e. if the additional compute time is being requested to explore areas not included in the original request, you only need to discuss the new computations. If however the additional time is needed because you need to revise your previous estimates of compute time needed (e.g. you underestimated the memory consumption so a larger memory factor for CPU time is needed, or you discovered that you need to increase the detail in the calculations), it is probably best to justify the whole amount. It would also be useful to explain why the previous estimates were off --- while habitually underestimating the resources needed will likely not be viewed favorably, the AAC is comprised of experienced researchers who understand that the unexpected sometimes occurs while performing advanced research.
If you are only requesting an increase in disk space, not compute time, you can just enter "No change to SUs requested". Be sure to complete the Disk space Justification section.
If you are not requesting any additional scratch or SHELL storage, you do not need to provide much here, although even in that case any information you do wish to volunteer is useful in helping us to understand how the cluster is being used. You likely cannot leave the field blank, but if you want to be minimalist you can just enter "base".
If you are requesting additional scratch and/or SHELL space, this field this field must be properly filled out. For each storage tier (scratch/SHELL) for which you are requesting additional space, give a quantitative justification for this request. This should be a quantitative estimate of the amount of storage needed to complete the research goals listed in the Milestones section, based on the actual number and sizes of files that will needed to be stored. Note that scratch space is only intended for the storage of data pertaining to "active" jobs, you are expected to delete data on scratch which is no longer needed, and move to SHELL data which should be retained but is no longer needed for active jobs. Your discussion of storage needed should include such.
If you have reason to prefer a specific HPC cluster over another, it is recommended that you include the reasons for your preference here. Although DIT and the AAC have the final as to which cluster your allocation is granted on, your preference (as indicated in the "Requested Cluster" or "Renewal Cluster" fields, together with your justification for that request here, will be considered and honored if feasible and reasonable.
If you are only requesting the base amounts (e.g. 0.1 TB of scratch and 1 TB of SHELL storage), or you are just requesting additional compute time (but no additional storage) for an existing allocation, you can just enter "No additional storage".
Otherwise, please state how you arrived at the amount of disk space that you are requesting. For scratch space, this should be related to your estimate of the amount of scratch space required for running a single job multiplied by the number of jobs you expect to be running more or less at the same time (remember that you are expected to delete unneeded output files, etc. after the job finishes, and move precious input/output elsewhere (e.g. SHELL storage) when they are no longer needed for running or soon to run jobs. For SHELL storage, remember that SHELL storage is not intended as a long term archive.
If you have an existing allocation and are requesting additional storage, please explain why the additional storage is needed, and try to estimate the total storage going forward. Remember that after some threshold, we will need to start charging for disk space (in order that we can grow the total amount of disk storage on the cluster).
For "base" level allocations, you need not put much here, e.g. "unknown" or "base" or similar. However, any information you do volunteer will help us better understand how you are using the cluster. If this application is just to increase the storage allotment for an existing allocation, you can just enter "no additional CPU time" or similar. However, for all other applications (e.g. renewal applications or other applications requesting more than the 50 kSU/year base level), this section is required and will be looked at closely. The larger the amount being requested, the more detail will be required.
In particular, for renewal applications the AAC will want to see a discussion of how the performance of the codes being used scale with the amount of resources being allocated to the job in order to show to the AAC that you are using the cluster resources efficiently. Typically, the performance of jobs will increase significantly as additional resources are made available to the job, up to some threshold value. Increasing the resources beyond that value yields little if any increase in performance, and indeed in some cases degrades performance. This threshold value is dependent on many factors, including details of the problem being solved, details of the algorithm and the specific coding of the algorithm being used, as well as details of the cluster it is being run on. Because of this, this threshold is generally best determined empirically. The AAC will want to see that you can show that you are running your jobs in the optimal range.
For CPU only jobs, this is generally a matter of running jobs (either production jobs or test jobs which are expected to behave similarly to production jobs for this purpose) with different numbers of cores, and look at the observed parallel speedup . You should be collecting some data about this while using your original award of compute time. Basically, compare the runtimes of one of your jobs as you vary the number of CPU cores available to the code. Traditionally this is compared against the runtime when using only a single core. (Ideally you would be running the same job over and over again, but often you can get decent results running comparable jobs. In that case, you might wish to take more than one data point for each number of cores to minimize effects due to differences in the jobs.) Typically, the code will speedup a bit less than linearly as the number of cores increases, to a point. After that, the code still speeds up, but with diminishing returns, and at some point the performance either levels off or possibly even degrades as more cores are added. Generally, you can stop running tests with more cores once you detect significant levelling off. The goal of these tests is to determine what the "sweet spot" is, i.e. the ideal number of cores to use to maximize efficiency and throughput.
The above discussion applies well to either pure multithreaded jobs (i.e. the job only does multithreading for parallelization) and for pure MPI jobs (i.e., each MPI task is single threaded). For hybrid jobs using both multithreading and MPI, you should modify the above to do a two parameter search for the "sweet spot" by varying both the number of cores used per MPI tasks and the number of MPI tasks. (This is assuming that neither are constrained by the nature of the problem being solved; if such constraints come into play, then briefly discuss them.)
If your CPU jobs are requesting more than the default amount of memory (usually 4 GB per CPU core), the AAC want to see that you have looked at the jobs to make sure they really require the requested memory. This is usually best determined by examining the MaxRSS value from jobs that successfully completed using the sacct command. In cases where there is significant and unpredicatable variations in the amount of memory used by jobs, you might wish to look for a value of the memory setting which satisfies almost all of the jobs, and then rerunning the few cases which need more with a higher memory setting, as this might be more efficient even after accounting for the rerunning of some jobs. E.g., let's assume you need to run 100 jobs, all single core, CPU only jobs which run for 20 hours, and 90 of these jobs require more than 10 GB but less than 12 GB of RAM, but the other 10 jobs require between 22 and 24 GB of RAM, but you cannot predict beforehand how much memory a job will need. If you were to run all 100 jobs requesting 24 GB of RAM (the amount needed by the most memory-hungry job), the hourly SU cost for each job would be 0.25 SU/GB/hour * 24 GB = 6 SU/hour, so the cost of all 100 jobs would be 100 jobs * 6 SU/hour * 20 hour/job = 12 kSU. But if instead we ran all jobs requesting only 12 GB of RAM, this would cost 0.25 SU/GB/hour * 12 GB = 3 GB/hour, and running all 100 jobs would cost 100 jobs * 3 SU/hour * 20 hour/job = 6 kSU. Of course, we would still have 10 jobs that failed and would need to be rerun with 12 GB, costing 10 jobs * 6 SU/hour * 20 hour/job = 1.2 kSU, for a total of 7.2 kSU, or only 60% of the SU cost compared to running all jobs requesting 24 GB.
For jobs that rely primarily on GPUs for computation, you should discuss whether the code can use multiple GPUs or not, and if it can use multiple GPUs whether it is restricted to GPUs on the same node. If multiple GPUs can be used, you should do a similar process of determining the performance of the code as a function of the number of GPUs being used in order to find the "sweet spot".
Also for GPU jobs, you should discuss which of the different types of GPUs available on the system can be used for your job, with a brief comment on why any GPU types are unacceptable (e.g. insufficient GPU memory, etc). If your job will run on multiple types of GPUs, you should show the performance as a function of GPU type. If using a more powerful (and more expensive) GPU does not provide a significant gain in performance, then using the more powerful GPU would be wasteful.
This field is just to help us collect some data about who is using the campus HPC resources, i.e. how many novice vs experienced users. Answering "No" will not cause your application to be rejected (although will raise some questions if this is a renewal application). We encourage researchers who have not used HPC resources in the past to explore whether HPC techniques could benefit their research.
The UMD HPC clusters , like most HPC clusters, run on Unix-like (specifically Linux at UMD) systems, and some advanced functionality is facilitated by (if not requires) some fluency with Unix. However, we have added a OnDemand Portal for interfacing with the cluster which greatly reduces the amount of Unix familiarity needed to use the cluster. We hope that such can help make HPC techniques more available to researchers at UMD.
Allocations of compute time are provided as service units (SUs), each of which represents one hour of wall clock time on one CPU core. Different categories of allocations provide cycles for newcomers (development: 20K SUs), for moderately demanding jobs (small: 60K SUs), and for compute-intensive research (large: 100K SUs). The larger allocations are naturally scrutinized more, and generally require the applicant to have shown reasonable knowledge of HPC and its issues, either from previous development grants or other experience on this or other clusters.
AAC allocations on the Zaratan cluster are one time grants of SUs with an one year (by default) expiration. SUs can be used as needed over the course of that year. You can apply to the AAC to renew your AAC allocation to extend the expiration another year (this can be done each year).
If an application is approved by the AAC, the allocation will be created, by default, shortly after approval - typically within about one business day. If you would prefer a later starting date (e.g. you will not be able to use start using the cluster immediately due to other priorities or because you are awaiting data), please specify such in the proposal, especially if there will be significant delay. The time between submission and approval can vary; if the application has sufficient detail that the AAC has no follow up questions, approval is typically within one or two business days. If there are follow up questions, an HPC administrator will contact you (typically via email) with the questions, and forward your replies back to the committee. Again, usually you should receive notification of approval or follow up questions within about one or two business days after a submission.
|
Students are only allowed ONE allocation from the AAC for their duration
with the university, and that will be a developmental allocation and not
renewable. If more CPU cycles are required, their faculty advisor must
apply for the allocation.
|
Criteria used in making such determinations include appropriateness of the clusters for the intended computation, the specific hardware and/or software requested, a researcher's prior experience with high-performance computing, the track record of a requestor who has received HPCC allocations in the past, and the overall merits of the research itself.
The AAC will determine which of the HPC clusters is most appropriate for the request. If the requestor has a specific cluster in mind, that should be explicitly mentioned in the proposal. In addition, the proposal should provide enough information to justify the use of a specific cluster (e.g. the need for Matlab DCS or other cluster specific licenses, or GPUs, or large memory nodes). While the committee will consider requests for a specific cluster, the committee decide which cluster to grant for a proposal based on which cluster is most appropriate for the request.
To submit an application, go to the HPC AAC application page, and select under the "Forms" menu item on the top menu bar the desired form. If you already have an allocation and you need to request additional time (either because more time is needed to complete the research than originally thought, or because the scope of research is expanding), then please select "Renew an allocation". Otherwise, select "New Allocation" for a new allocation.
When applying to the AAC for an allocation, remember that the AAC generally prefers to award allocations for specific projects. It is best to make a proposal for specific projects, with milestones that can be achieved within one year (or whatever time frame of requested allocation is), and if needed make a renewal request for more time for a second set of goals. In addition, it is useful to include the following in your proposal:
The AAC is unlikely to grant large allocations without a good discussion of most of the above points. However, it is recognized that not all applicants are experienced High Performance Computing (HPC) experts. Indeed, one of the aims of the AAC with this allocation process is to allow researchers who are not even sure if HPC techniques will work for their research an opportunity to try HPC out without a monetary investment. So if you are unable to address all of the above points, you can still apply for an allocation. It is likely that the AAC will, at least initially, only approve your application for a developmental (20 kSU) allocation, but that should not be viewed as a set back. The 20 kSU allocation might even be enough for some small projects, but at minimum it should allow you to collect the information regarding SUs required per job, scalability of code, etc. to address the above points when you request additional time in a renewal allocation.
If you have questions regarding the application process, or what information is requested or how to obtain such, or any other issues, please feel free to contact us.
Several samples of approved applications have been made available with the kind consent of the applicant to assist others who wish to apply.
As befits an institute of learning, DIT is willing to make a reasonable amount of HPC cluster resources available to classes in most cases. If you are teaching a course and wish to use HPC resources, please submit a request to HPC admins for class access. Such requests should come from the instructor of record for the class, and should include the following information:
Please provide the above information to the best of your ability when making a request. If you are unsure what is meant or how to answer something, let us know and we will try to clarify. The more completely you answer things the faster the process will go.