Skip to main content

Allocations and Job Accounting

  1. The Basics
  2. Choosing the Account to Use
  3. Managing Allocations
    1. For PIs
    2. For College/Dept Pool Managers
  4. Monitoring Usage
    1. How many SUs are left?
    2. General information about an allocation
    3. Seeing job history
    4. Monitoring for excessive usage
    5. Usage reports

The Basics of Allocations and Job Accounting

As an user of the cluster you belong to at least one project , and each project contains of one or more allocations , at least one of which you must also be a member of. Each allocation represents an allotment of resources on an HPC cluster. These resources include compute time (measured in SUs (or more commonly kSUs), as well as storage space (measured in TB ) on both the scratch and the SHELL tiers.

The reason that there may be multiple allocations within a project is because the allocations can come from different resource pools and have different expiration dates and replenishment schedules. Allocations from the Allocations and Advisory Committee (AAC) typically have an duration of one year, although they may be renewed via via an application showing reasonable past use. In most cases such allocations are awarded a fixed amount of resources for the duration of the allocation, i.e. for the year. Allocations coming from college or departmental pools will be subject to the policies of the college/department granting the allocation, but usually these will also be for one year terms, but allocated and replenished quarterly. Allocations purchased from DIT will be governed by the MOU signed at the time of purchase, but typically will be for one year, allocated and replenished quarterly.

The storage allotments for all of the different allocations within a project are typically summed, individually for each tier, to get the effective storage limits for the entire project (the group of members in the project) on that storage tier. I.e., typically the storage limits apply across all allocations, so you do not have to assign specific files to specific allocations. Storage allotments are for the duration of the allocation; they do not increase automatically with time. Note that if an allocation expires, the effective limits on storage for the members of the project may be reduced, which could potentially lead to to the disk usage exceeding the limit. In this case, members will be notified of this issue, and given a week to resolve the situation, either by reducing the amount of disk storage used (deleting unneeded files, moving files off the cluster, etc), or increasing your storage limit (by renewing the expired allocation, obtaining an increase in the storage allotment on a remaining allocation, or obtaining (perhaps purchasing) a new allocation with additional storage).

Compute allotments are distinct across allocations. Each allocation with an compute allotment has its own Slurm allocation account , and when submitting a job you can specify which allocation account the job should charge against. Allocations awarded by the AAC typically award a fixed amount of compute resources for the duration of the allocation (e.g. one year). Allocations from college or departmental pools will typically have their SUs allotted quarterly; e.g. if an allocation is granted 800 kSU/year, this will be meted out at 200 kSU/quarter for each of four quarters. All jobs that are submitted are associated with an account; this can be specified with the -A flag when the job is submitted, or will use the submitter's default account. For more information on specifying the account, or changing your default account.

Generally, the allocation account will get charged a number of SUs for the amount of resources used while the job is running. This SU cost is based on the amount of time the job ran in hours (the walltime of the job) This factor is the maximum of: The hourly SU cost for a job is the maximum of:

NOTE: The SU costs above are based on the amount of resources allocated to a job, not what is actually used by the job, as the requested and allocated resources are not available to any other jobs while your job is running. So if you request 100 GB of RAM for a job that only used 40 GB of RAM, you are wasting both cluster resources and the SUs in your allocation. Similarly, requesting nodes in exclusive mode will cause you to be charged for all of the resources on the allocated nodes.

However, you are charged based on the actual time the job ran, not the requested walltime. So if you submit a job with a requested walltime of 1 day and it terminated after only 30 minutes, your job is only charged for 0.5 hour. So you should always set the walltime to the longest time you expect the job to run, perhaps with a little padding. But you should not set the requested walltime excessively long, as that could penalize your job in scheduling (plus there can be situations wherein the job stops running usefully but does not terminate --- in those situations, you will be charged for the full walltime until the job terminates).

Job in the scavenger paritition are an exception --- jobs in this low priority allocation are not charged, but are subject to preemption .

WARNING
You are charged for cores consumed, not used. I.e., if you request 1 core on a node, but also request no other jobs be run on the node, you will be charged for ALL cores on the node assigned, since no one else can use them while your job is running. See for more information.

The scheduler keeps track of all jobs running against a given account, and keeps track of how many SUs are required to complete these jobs (using the walltime requirements requested when the job was submitted). Before a new job charging against that account is started, the scheduler makes sure that there are sufficient funds to complete it AND all currently running jobs charging against that account. If there are, the job can be started; otherwise, it is left in the pending state with a reason code AssociationJobLimit.

Research groups can get allocations in one of several ways:

Paid and Departmental Allocations

This section discusses some general concepts related to allocations which were purchased from DIT, which includes allocations awarded by departments and/or colleges from pools of HPC resources which they have purchased from DIT. Allocations purchased directly from DIT are governed by the MOU between the purchaser and DIT which was signed at the time of purchase. Allocations which were granted from departmental and/or college pools are subject to whatever policies the department and/or college wish to impose. The information below generally holds for such allocations, but the aforementioned MOUs and departmental/college policies have precedence.

These allocations have an expiration date; typically for one year from the start date, but this is negotiable in the MOU, and shorter terms (down to even a single quarter) are available on request. For departmental/college allocations, the expiration date is still nominally set to one year, but the allocation will persist until the contact for the department/college tells us otherwise.

SUs are meted out quarterly; if your allocation is for 800 kSU per year, you will normally get 200 kSU per quarter for 4 quarters. This can be modified beforehand if needed; if you know you will need more compute time in the first two quarters you could set up the same 800 kSU per year as e.g. 250 kSU/quarter for the first two quarters and 150 kSU/quarter for the next two quarters. SUs which are not used in one quarter do not roll over to future quarters --- at the end of a quarter, all unused SUs simply disappear, and (if the allocation did not expire), the allocation will be replenished with the SUs for the next quarter.

Storage (both scratch and SHELL tier) are also allocated quarterly, but as files generally persist across quarters, this is usually not as noticeable. If the storage quota for your project decreases at some point (e.g. an allocation expires or is revoked, causing the loss of the contribution of that allocation to your storage quotas) resulting in your project being over quota on one or more storage tier, the PI of the allocation will receive email from HPC staff informing them of the issue, and requesting that the storage be brought under quota in a timely fashion. This can be either by reducing the storage footprint, or increasing the storage quota (e.g. getting more quota from the AAC, department/college pools, and/or purchasing from DIT). If this is not done in a reasonable time period, we may be forced to bill the PI for the excessive storage used.

AAC-Granted Allocations

The HPC Allocations and Advisory Committee (AAC) can grant one-time unpaid allocations to faculty and students for small projects, classes, feasibility tests, etc. These allocations are granted out of computing resources purchased by funds from the Provost's Office and the Division of Information Technology.

Faculty members can apply for such allocations by submitting an application to the AAC. This can be submitted by the faculty member themselves, or by a post-doc or student on behalf of the faculty member (in the latter case, the faculty member will be required to consent to assuming "ownership" of any resulting allocation). The review of the application by the AAC is more rigorous as more resources are requested; faculty members are eligible for up to 50 kSU per year with very little information. With proper justification and benchmarking and approval by the AAC, faculty members can get up to 550 kSU per year (including the aforementioned 50 kSU) for free from the AAC.

Generally, the AAC will not grant more than 50 kSU/year to faculty members without the faculty member having run jobs on the cluster to (and summarizing such in the application):

If the above criteria have not been met, the AAC will generally limit a researcher to an initial grant of 50 kSU/year. That allocation can be used to start the research, and while doing so collect the aforementioned benchmarks which can be used in a renewal request for additional resources (which can be submitted anytime; you do not have to wait for the initial allocation to expire).

The 50 kSU/year allocation size is also useful if you are new to HPC or the cluster. While the form is the same for all allocation sizes, the review of the application for the first 50 kSU/year for a faculty member is relaxed .

In general, when applying, answer all fields to the best of your ability, and HPC staff will get back to you with questions if more information is needed.

Choosing the Account to Use

If you only have a single account (check with the sbalance command), you can skip this section. You only have the one account, so there is nothing to choose.

If you have multiple accounts due to your membership in multiple research groups and/or projects, you may wish to choose which account you use based on your job. I.e., if the job is doing something for group A, you probably should only submit it using one of the group A accounts, even if you also have access to group B accounts. If the research areas of the two groups overlap, you will need to follow what ever group-specific policies may exist (contact your colleagues).

If you have access to multiple allocation accounts within the same research group/project, then there is a choice to be made. If your research group has group-specific policies about which allocation to use, follow those. Otherwise, you will normally see an allocation from the Allocations and Advisory Committee (AAC) plus one or more allocations from college and/or departmental resource pools, and maybe an allocation purchased from the Division of IT. Again, the sbalance command is an easy way to see this, e.g.

login.zaratan.umd.edu:~$ sbalance
Account: smith-prj-aac (DEFAULT)
Limit: 	   250.00 kSU
Unused:    126.50
Used:  	   123.50 (49.4% of limit)

Account: smith-prj-eng
Limit: 	   200.00 kSU
Unused:    110.00 kSU 
Used:  	   90.00 kSU (45.0 % of limit)

Account: smith-prj-ipst
Limit: 	   175.00 kSU
Unused:    160.00 kSU 
Used:  	   15.00 kSU (8.7 % of limit)

Account: smith-prj-paid
Limit: 	   275.00 kSU
Unused:    180.40 kSU 
Used:  	   94.60 kSU (34.4 % of limit)

login.zaratan.umd.edu:~$

In the above example, the user is a member of 4 allocations in the smith-prj research group/project; the first being awarded from the AAC, the second from the School of Engineering, the third from the IPST department, and the last was purchased from DIT.

In such a case, we generally encourage users to use the AAC allocation as a last resort. Paid and College/Departmental allocations are typically awarded quarterly, meaning that at the start of each new quarter year (1 Jan, 1 Apr, 1 Jul, 1 Oct), any unused SUs in that allocation disappear, and the allocation is replenished at its nominal quarterly level. Since the SUs in these allocation typically have the shortest lifespan, you generally want to use those SUs first.

Allocations granted from the AAC, on the other hand, represent an one-time grant of resources, and although these SUs will also expire and vanish at the end of the term of the AAC allocation, this is typical on a timescale of about one year. Also, AAC allocations do not automatically renew --- you (or your advisor) must apply to the AAC for any renewals, etc.

Thus, we normally recommend that you set your default allocation to a paid and/or departmental allocation, and normally charge jobs against those allocations. If and when you encounter a situation wherein your workload for a given quarter is exceeding the quarterly allotment from your paid and/or college/departmental allocations, then you can dip into the AAC allocations to make up the difference. Although exceptions can arise, we find that this type of arrangement is likely to maximize your benefit from the allocations.

Another consider to be accounted for in this decision is the amount of SUs left in the allocation, and the number of running and pending jobs charging against the allocation. The scheduler will not start a job unless it determines that there are sufficient SUs in the allocation to complete the job in question along with all currently running jobs charging against that allocation. In order to estimate the number of SUs needed to complete a job, the scheduler uses the maximum walltime requested for the job.

For example, if you have 5 kSU unused in your allocA allocation, and the job you wish to submit has a walltime of 50 hours and an SU billing rate of 96 SU/hour, the scheduler will assume the job needs 4.8 kSU to complete. If there are no other running jobs charging against the allocA allocation (this includes jobs by other people in your group), the scheduler will consider the job able to start if sufficient compute resources are available. However, if there are 5 jobs already running and charging against the allocA allocation which have an SU billing rate of 50 SU/hour and are halfway through their requested 6 hours of walltime, the scheduler will estimate that each job will run for 3 more hours and so consume 150 SU each, or 0.75 kSU for all 5 such jobs. In that case, the new job will not start because 0.75 kSU for the running jobs plus 4.8 kSU for the new job will exceed the 5 kSU unused in the allocation. This calculation will change over time; e.g. if 4 of those jobs finish right after this calculation (so effectively do not consume any of the 5 kSU unused), the next time the scheduler looks at the job it will see 0.15 kSU needed to finish the remaining job, and 4.8 kSU to finish the new job, or 4.95 kSU total, and the job will be able to start.

Unfortunately, the schedule cannot handle things like "charge this job to allocA, unless there are not enough SUs in which case charge it against allocB". So once the allocations are nearing depletion, you will need to more closely monitor the usage and make such determinations as to which allocation to schedule a job against. But this is generally only an issue when the allocation is close to being depleted.

WARNING
Note that the queuing system will NOT automatically select another account if there are insufficient funds in the account specified for the job. E.g., if you have access to both allocA and allocB and you specify that a job should charge against allocA (either explicitly or via the default account), the scheduler will not change that to allocB if allocA is depleted. The job will just wait in the queue until such time that additional SUs are available in allocA.

Note also that others in your group may have access to the same account, so just because funds were there when you submitted a job, someone else's jobs may have started since then and may reduce the funds in the account.

See here for more information about specifying the account to be charged when submitting a job.

WARNING
It is recommended that you generally use up your high-priority funds first, instead of using normal-priority funds. If you do not use them, they go away (or effectively get converted to normal priority) at the end of the month)
!-->

Managing Allocations

This section discusses various topics related to the management of allocations. It is broken into two sections depending on what level of management:

  1. For PIs managing the own research project
  2. For managers of College or Departmental pools of HPC Resources

Allocation Management for PIs

Still under construction.

Allocation Management for College/Department Pool Managers

Certain units may have pools of resources on the cluster that they can allocate to researchers within their units. These pools are typically granted in return for contributions of hardware to the cluster. Whereas on the Deepthought2 cluster these pools were often arranged as one large allocation account, that proved to have several problems. Such an arrangement made it difficult for different members of the same research group to share files without sharing them with the entire department or college. It also made it difficult for system administrators to contact faculty members regarding student accounts, forcing the departmental/college contact to have to act as the "middleman" in all such communications. It also meant that the departmental contacts had to handle all of the requests from users from the department wishing to get access to cluster, and conversely to handle the removal of all users who should no longer have access (and unfortunately, that typically was ignored). This is all made more complicated as the departmental contact usually needs to contact the actual PI/faculty advisor to determine the eligibilty.

On Zaratan, the addition of storage tiers as allocated resources will just make that even more problematic, especially when the usage exceeds the quota. We are hoping to converting such large departmental/college allocations into pools of resources which can be suballocated to to researchers in the unit. Individual PIs in the unit can be granted allocations of resources from the pool. The PI can then manage which users have access to the allocation, which then removes some of the burden from the departmental manager and places it in the research group with more knowledge of the situation.

From the PI's perspective, they will typically already have a project in allocations for the research group. This will typically include an allocation from the campus Allocation and Advisory Committee, and if you grant them an allocation from your pool, that will appear as another allocation under the project. There could be additional allocations as well, if the PI has allocations from another department or unit, or if they have purchased resources. The PI could also have multiple projects --- this could be because they have multiple research groups, or more typically if they have a project for a class they are teaching. If they are also a pool manager, the pool will also appear as a separate project. Generally, all users belonging to allocations within the same project belong to the same group, and scratch and SHELL storage are organized by project.

The compute resources for each allocation in the project will appear as distinct Slurm allocation accounts , and when a job is submitted it will need to specify which account to charge against (or charge against the default allocation account). Storage resources are handled differently --- because it is difficult to classify a file as belonging to one allocation or another, and even messier to have to assign it in such a fashion, we sum up the storage allotments (separately for each storage tier) for all allocations in a project, and use that to set the storage limit for the project's storage directory on that tier.

Pool managers are responsible for allocating the resources in the pool to the individual researchers. Unfortunately, system administrators are not a this time (or in the foreseeable future) able to delegate the actual ability to create/modify allocations to the pool managers, so pool managers will need to send email to hpcc-help@umd.edu to request your changes. This will be handled by people, so you can send multiple requests in a single email.

You are allowed to oversubscribe the compute and/or storage resources in your pool; that means it is permissible for the sum of the compute and/or storage limits (separately for compute and each tier of storage) alloted to each of the suballocations from this pool to exceed the size of these resources in the pool. This was not initially allowed on Deepthought2, which is one reason many units adopted the large departmental/etc. pools --- there were some units that had a fair number of modest HPC users who had a modest average quarterly usage of compute resources, but might need double their average usage in occasional quarters. Without oversubscription, one would need to set the suballocation size to double their average usage, which means on average the allocation would only be half utilized, and (since we are not oversubscribing), the other half of their allocation could not be used by anyone else. With oversubscribing, the unused half of that suballocation could be doubly (or more) allocated, increasing the effective utilization.

While this is certainly advantageous, it needs to used carefully, because to be fair to other users on the cluster, the total usage from suballocations from your pool are restricted to the available compute resources in pool (on a quarterly basis). E.g., if you have a pool of 100 kSU/quarter and you have two suballocations A and B to which you assign 75 kSU/quarter each, then if A uses 75 kSU in a given quarter, B is limited to 25 kSU. So this will work if both A and B only use 50 kSU/quarter on average, and when one uses more than average, the other uses correspondingly less. But clearly their will be complaints if there is a quarter where users of both A and B suballocations want/need to use more than 50 kSU.

You can also oversubscribe storage. This is more problematic, since while SUs are somehwat ephemeral (every quarter the quarterly SU usage is reset), files tend to be more permanent --- once created they remain until someone deletes it. Furthermore, there are various mechanisms by which individual projects can end up going over there filesystem limits. For instance, the project limit is contributed from various allocations, which can expire. E.g., consider a project with a 3 TB limit, with 1 TB coming from an AAC allocation, one from your pool, and one from a project level purchase agreement with DIT, and assume that the project is using 2.9 TB of storage. If one of these allocations expires, suddenly the project is consuming 2.9 TB but only has a 2 TB limit. Assuming the allocation from your pool has not expired, your pool will now be considered to be consuming 1.45 TB from that suballocation (despite your only authorizing 1 TB to that suballocation). It is even more complicated than that, as the enforcement of storage limits has some technical limitations. The Zaratan scratch storage has some delays in the quota enforcement stage, so users can in some cases continue to write data over the limit for a fraction of an hour after exceeding the limit. And the SHELL storage for each project is divided into multiple volumes (at least one per user, and perhaps more), each of which has a size limit, but does not have quotas as such. The sum of these maximum volume sizes can exceed the SHELL storage limit for the project. So it is very possible for individual projects to exceed their storage limits, at which point they will be notified and instructed to rectify the situation, But such can also impact the usage of your pool.

It is recommended that pool managers use care if/when oversubscribing, especially for storage. Ideally, you should avoid oversubscribing initially if possible, and wait until you have several quarters worth of data showing actual utilization of the resources in the pool. If there is a consistent history of under-utilization, then it might be reasonable to allow for some oversubscription if needed, but even then you probably should be a bit conservative and allow some room for fluctuations in the average usage. For storage, you should also remember that storage usage, especially on the SHELL tier, is likely to monotonically increase.

The are several things that pool managers can do in ColdFront

Monitoring Allocations

You and your research group are responsible for ensuring proper rationing of the funds in your account(s). Excessive use of funds for a "co-op" type of project in the first month of a quarter could result in no funds at all for the next two months in either the high-priority or standard priority allocation.

This can be deliberate and beneficial, e.g. if you have important deadlines at the end of the first month a the quarter and are willing to "borrow ahead" to get computations for that completed before the deadlines. This is an advantage of the model used by the Deepthought HPC clusters; you can use nearly 3 times the power of the computers you purchased in a single to rush out computations, at the cost of having very limitted usage the following two months (but since it is after the deadlines, that might not be important).

But if this occurs because some junior member of the group is sending an excessive number of very expensive jobs, this can be quite problematic, especially as you might not notice the impact of the errant user until too late.

The Division of Information Technology cannot tell which jobs are important and which are not, nor what is good usage of your allocation funds and what is not. If we notice seriously problematic usage (e.g. a job reserving 10 nodes but only running processes on 1 node), we will do our best to notify and instruct the relevant users. But you are responsible for monitoring your own jobs, and it behooves you to monitor jobs of other users of your allocations. We will provide the necessary tools to do such, but we strongly advise all research groups to have at least one person monitor the usage of their allocations' funds regularly to ensure there are no problems, or at least catch any problems early.

How many SUs are left in my allocation?

The first level of monitoring of your allocations is with the sbalance command. E.g.

payerle:login-1:~>sbalance
Account: test-hi (dt)
Limit:     163.52 kSU
Available: 163.47 kSU 
Used:      0.05 kSU (0.0 % of limit)

Account: test (dt)
Limit:     327.04 kSU
Available: 325.33 kSU 
Used:      1.71 kSU (0.5 % of limit)

Without any arguments, it will list usage metrics for all accounts to which you have access to. The above listing is from early in the quarter for a co-op type project; note that both accounts are nearly full, and that the test account has nearly double the amount of the test-hi account. The line starting with "Used" not only gives the number of kSU used, but also the usage as the percentage of the limit. If this percentage is significantly higher than the percentage of time between now and the start of the month (for your high-priority account), or the start of the quarter (for normal-priority accounts), you might need to get concerned. I.e., if at one week into the month you see the usage on your high-priority account is over 30% of the limit, your group is burning your SUs faster than they will be renewed, and you might have some time at the end of the month with nothing in your high-priority account.

For AAC grant type accounts, there is no monthly or quarterly replenishment. The "Limit" should reflect the amount of compute time the AAC granted you, and the percentage is how much of that you have used. If the percentage used is significantly greater than the percent of your work which is complete, you should consider working on an update to your proposal to request more time.

If you are tasked with monitoring the usage of the accounts by your colleagues in the project (or have taken said task upon yourself), you can use the -all flag to sbalance to see who is using the funds in the account. You might also wish to use the -account flag to limit the output to a single account, e.g.:

login-1: sbalance -account test-hi -all
Account: test-hi (dt)
Limit:     163.52 kSU
Available: 102.07 kSU 
Used:      61.45 kSU (37.6 % of limit)
        User jtl used 17.6044 kSU (28.6 % of total usage)
        User kevin used 13.3456 kSU (21.7 % of total usage)
        User payerle used 30.5000 kSU (49.6 % of total usage)

This lists the same information as before, with the addition of showing every user who has used the account in the time period, showing not only the number of kSU they consumed, but what percentage of the total usage for the account. E.g., in the example above, you can see that user payerle is using almost as much as users kevin and jtl combined. You can add the flag --nosuppress0 if you want to also see lines for everyone with access to the allocation but who did not consume any time since the last refresh.

The --help option to sbalance will display usage options, most of which were discussed above.

The time period for the usage statistics depends on the type of account and project. For co-op (replenishing) projects, it is from the start of the month. For AAC grant accounts: from the start of the project/grant.

General information about an allocation

General information about allocations you belong to can be obtained with the my_projects command. This command can only be run from the login nodes (i.e. it will not work on the compute nodes), and provides basic information regarding allocations you belong to.

Usage is basically my_projects to display information for all allocations that you are a member of, or my_projects ALLOCATION_NAME to display information for a specific allocation (you can give multiple ALLOCATION_NAMEs to list information for multiple allocations). You may also wish to include one or two --verbose (or -v for short) flags to include more information. You can also give a --help for a full description of all the flags the command accepts.

Without any verbose flags, it will display the name of the allocation project, the name of the parent project (if any), and the department and college associated with the project.

With one verbose flag, it will also display the "points-of-contact" for the project, and the members of the project. the points-of-contact are the people who are authorized to add/remove members from the allocation. It will also display the base kSU level, and indicate whether the project autoreplenishes each quarter or not.

The information with two verbose flags is probably not very useful; basically a description of the project (which is usually not informative) and the over/underusage alert thresholds which determine if/when the points-of-contact are emailed regarding excessive/etc usage of their allocation (if no value is listed, a global default is used). The over/underusage thresholds are explained a bit more in the section on checking for excessive usage.

NOTE: the allocation project names are for the project. Some projects have both a standard and high-priority allocation account; however, they are still one project, and only one listing will be shown in the my_projects command. The base kSU level is the total of the standard and high-priority kSU at the start of the quarter.

Seeing job history

The sacct command can be used to view the accounting records of jobs, both past and currently running. It takes some time to run, and can display a fair amount of information (which is documented in its man page). You will almost always wish to restrict it to a time range, so to see the usage of account foo for the month of November 2014, one could use


login-1> sacct --format=JobID,User,Account,ReqCPUs,AllocCPUS,Elapsed,CPUTime \
	-a  -X  -S  2014-11-01 -E 2014-11-30 -A foo

       JobID      User    Account  ReqCPUS  AllocCPUS    Elapsed    CPUTime 
------------ --------- ---------- -------- ---------- ---------- ---------- 

2717747       payerle  foo             16         20 1-00:00:09 20-00:03:00 
2717748       payerle  foo             16         20 1-00:00:09 20-00:03:00 
2717749       payerle  foo             16         20 1-00:00:09 20-00:03:00 
2717750       payerle  foo             16         20 1-00:00:08 20-00:02:40 
2717751       payerle  foo             16         20 1-00:00:08 20-00:02:40 
2717752       payerle  foo             16         20 1-00:00:08 20-00:02:40 
2717753       payerle  foo             16         20 1-00:00:17 20-00:05:40 
2717754       payerle  foo             16         20 1-00:00:17 20-00:05:40 
2717755       payerle  foo             16         20 1-00:00:17 20-00:05:40 
2717756       payerle  foo             16         20 1-00:00:12 20-00:04:00 
2718384       payerle  foo             10          0   00:00:00   00:00:00 
2718385       payerle  foo             10          0   00:00:00   00:00:00 
2718386       payerle  foo             10          0   00:00:00   00:00:00 

Here,

Monitoring for excessive usage

An important aspect of managing the usage of an allocation is ensuring that SUs are being consumed at a reasonable rate. The system intentionally allows flexibility in the rate in which SUs are consumed; e.g. if you have a major conference in the middle of a quarter, you might wish to (and can) use up most or all of your allocated funds for a quarterly replenishing allocation in the first month of the quarter, leaving (almost) nothing left for the remaining two months of the quarter. If that is your intent and desire (and the rest of the users of this allocation agree with you), all is well. However, if a few profligate users consume most of the quarterly allocation in the first month without the consent of the rest of the users of the allocation, there is a major problem.

From the system's point of view, the two examples above will look the same --- the SUs were consumed at an excessive rate in the first month of the quarter. We cannot tell if that was done for a good reason or by mistake by inexperienced users --- that is a judgement call which the points-of-contact (PoCs) of the allocation will need to make. What we can do is try to alert the PoCs when something like that appears to be happening, and hopefully early enough that if it is happening improperly that behaviors can be adjusted before this leads to serious problems.

NOTE: the following only applies to quarterly auto-replenishing allocations. Non-replenishing allocations (e.g. allocations granted by the AAC on the Deepthought HPC clusters and Engineering Startup Allocations (i.e. allocations whose names start with "esu-")) are not currently supported by the tools described below. Since they do not auto-replenish, you can use the sbalance command described previously to see how much of the total allocation has been consumed, and compare that to your estimates of the amount of work needed to complete the project.

The command check_project_usage compares the fraction of the allocation's quarterly allotment that has been consumed in the current quarter to the fraction of the quarter that has gone by. If the fraction of SUs used exceeds the point in the quarter by more than a certain threshold, it will flag that allocation as having unsustainable usage. (It also similarly checks for significant underusage, but the default threshold for that is such as to never flag underusage). The global default overusage threshold is 15 percentage points; PoCs can request different default thresholds for a specific allocation (just send email to hpcc-help@umd.edu requesting such; this will change the defaults used in the automated mail as well), and anyone can specify thresholds on the command line as well. E.g., if we are one third of the way into the quarter (i.e. one month into the quarter) and 50% of the allocation has been used, and alert will be raised using the global default threshold (as 33% + 15% = 48% < 50%) . If a threshold of 20% was to be used, no alert would be raised (as 33% + 20% = 53% > 50%).

By default, the check_project_usage command will check all allocations for which you are a member for excessive usage. If one or more allocations appear to be being consumed at an unsustainable rate, it will print usage information and warnings for that allocation. If no excessive usage is detected, normally it will not print anything. (NOTE: if you are a member of non-replenishing allocations as well, you will get a brief warning stating that the code is skipping the non-replenishing allocation.)

You can provide the --help or -h flags to get full usage information. You can specify allocation project names on the command line to only check the named allocations (NOTE: these are allocation project names, so should not include the -hi suffix; because the standard and high-priority balances are linked, it checks both simultaneously.) You can also give the --verbose or -v flag, which will cause usage information to be displayed even if no over/underusage condition was flagged.

login-1> check_project_usage
Project: testproj1
Time: 2016 Oct 14
Overquota Threshold: 15.0%
Underquota Threshold: 100.0%
------          TimePeriod (percent into) Allocation   Available    PctUsed
HiPriority      month      (  43.7% into) 67.500 kSU   15.858 kSU   76.5%  
Total           quarter    (  14.7% into) 202.500 kSU  125.970 kSU  37.8%  

*** Excessive rate of consumption for HiPriority!
*** Excessive rate of consumption for Total!
login-1> 
login-1> check_project_usage -v 
Project: testproj1
Time: 2016 Oct 14
Overquota Threshold: 15.0%
Underquota Threshold: 100.0%
------          TimePeriod (percent into) Allocation   Available    PctUsed
HiPriority      month      (  43.7% into) 67.500 kSU   15.858 kSU   76.5%  
Total           quarter    (  14.7% into) 202.500 kSU  125.970 kSU  37.8%  

*** Excessive rate of consumption for HiPriority!
*** Excessive rate of consumption for Total!
========================================
Project: testproj2
Time: 2016 Oct 14
Overquota Threshold: 15.0%
Underquota Threshold: 100.0%
------          TimePeriod (percent into) Allocation   Available    PctUsed
HiPriority      month      (  43.7% into) 60.181 kSU   53.543 kSU   11.0%  
Total           quarter    (  14.7% into) 180.544 kSU  162.452 kSU  10.0%  
login-1> 
login-1> check_project_usage testproj2
login-1> 

The first time we execute check_project_usage above, it displays the usage for testproj1 with warnings of excessive usage for both the high-priority allocation account (as 76% > 43% + 15%) and the total allocation (as 37% > 14% + 15% ). The second run has the verbose flag, and so in addition to showing the excessive usage for testproj1, it also displays the usage for testproj2 even though it is not problematic (PctUsed is less than the "percent into" the month/quarter, respectively). The final invocation does not have the verbose flag, but specifies to only check testproj2; this produces no output as there is no excessive usage condition.

If you wish to include this command in your dot files to alert you to overusage issues whenever you log in, be sure to run it only for interactive sessions --- not only will it needlessly slow down non-interactive shells, but if it produces output it can mess up file transfers with scp, etc. E.g., for csh or tcsh users, something like:

if ( $?prompt ) then
	check_project_usage
	... other interactive only commands if desired ...
endif

If your default shell is sh or bash, something like:

if [ ! "x$PS1" = "x" ]; then
	check_project_usage
	... other interactive only commands if desired ...
fi

The Division of Information Technology actually runs a similar script every few hours on every auto-replenishing allocation, and will send email to the points-of-contact for the allocation if it is flagged as being consumed at an unsustainable rate. To avoid "spamming" the points-of-contact, we will not send out email to a given user more than once every three days. In this automated case, the project specific overusage threshold is used (or the global default is not project specific threshold was set). A point-of-contact can request a change to the threshold for any of their allocations be sending an email request to hpcc-help@umd.edu. They can similarly request a change in the minimum number of days between emails sent to them. NOTE: the thresholds are per-project/allocation, and affect alerts to all points-of-contact for that allocation. The minimum number of days between emails are per person, and affect alerts for all allocation projects that person is a point-of-contact for. Also note that limiting of the frequency of emails is applied separately to each project you are a point-of-contact for, so if you receive an alert about allocation A today, you may still receive an alert about allocation B tomorrow, but should not receive another alert about allocation A for several days.

Usage reports

For non-replenishing allocations, the sbalance command returns information pertaining to the usage of the allocation over the allocations lifetime. For replenishing allocations, however, most of the tools mentioned above only return data about usage for the current quarter. While this is probably what most users are concerned with most of the time (e.g., if I want to figure out if there are enough kSUs to run my job now, usage from previous quarters is irrelevant), but sometimes one needs information regarding usage over longer time scales. This is especially useful for people who manage "super-allocations".

There are a couple of tools available to get more historic information regarding allocation use:

The Deepthought XDMoD website is a web page running the Open XDMoD (Open XD Metrics on Demand) web application. This can present in graphical form many metrics pertaining to the Deepthought clusters. One can see how many kSUs were consumed by a given allocation as a function of time, or what the average job length for an allocation over the past year. Some of the more advanced filtering and reporting features requires one to register for a "login account" on the XDMoD website (unfortunately, there is no easy way to tie this into our existing authentication system); you can do so from the website.

The slurm-usage-report command runs from the login nodes of either Deepthought cluster. This command examines all the job records related to the allocation account(s) specified, and provides summaries. (As opposed to the sacct command which lists details for each job, but does not summarize.). Because it has to go through all the job records, it does tend to be a bit slow.

We only discuss some of the more commonly used options below; the command supports a --help or -h option which provides more information on its usage (including some options to provide even more usage information). The commonly used options are:

The slurm_jobstats_for_alloc also prints information about usage of allocations on the Deepthought cluster, but is generally more geared toward assisting managers of superallocations determine which suballocations have and have not been using the cluster. By default, it will print out for each allocation account the following information (for the specified timeperiod):

We only discuss some of the more commonly used options below; the command supports a --help or -h option which provides more information on its usage (including some options to provide even more usage information). The commonly used options are:






Back to Top