- 2016-08-03: Sandhills: Check job status
Category: System FailureSystem issues today affected jobs on Sandhills. If you had pending jobs on Sandhills since the upgrade, please check their state as it may be necessary to resubmit these jobs.
show details...A software update was made to the Sandhills configuration management service this morning. It later became apparent that the service was giving incorrect data to the rest of the cluster, causing communications problems for the Slurm resource manager. This caused job startup failures for jobs starting during this window of time. It also made running jobs appear to have exited out to the controller, when in fact, they didn't. To clear up the inconsistent state of jobs between the controller and the workers, all jobs running on the workers that escaped the Slurm controller's knowledge were killed. This was unfortunately a necessary step to return the cluster to normal operation. - 2016-08-02: Sandhills: Maintenance complete
Category: General AnnouncementMaintenance is complete on Sandhills. Please let us know of any troubles using the cluster by sending email to hcc-support@unl.edu
show details...Maintenance is complete on Sandhills. Please let us know of any troubles using the cluster by sending email to hcc-support@unl.edu - 2016-08-01: /work file purge date delayed to Sept. 15, 2016
Category: General AnnouncementAn earlier email had warned of the need to purge all files older than 6 months from /work to begin today (August 1, 2016). The HCC community has responded and removed enough data that all /work systems are currently at 80% of capacity or less. We are optimistic this will buy us a few more weeks of normal usage until September 15, 2016. Thank you to all of you who have cleared space in the last few weeks!
show details...
We were specifically asked by several HCC groups to wait until a few weeks into the semester to make certain all are aware of this new policy (details are repeated below via the link). We are happy to be able to do so, barring an overrun before Sept. 15. Scripts will continue to run so users will be warned if they have files that have not been accessed for over 6 months which are subject to removal, but no files will be moved or removed until Sept. 15.
Best regards,
David SwansonThis notice concerns a policy that affects all HCC machines and potentially all HCC users.
SUMMARY:
HCC is implementing a new automated file purge policy on the /work filesystem for all HCC machines. Starting August 1, 2016 (note: now Sept. 15, 2016) we will remove any files on /work which have not been accessed for at least 6 months. This will not affect the /home filesystems or the Attic storage system.
EXPLANATION:
The /work filesystem exists on each HCC machine for working files. It is not designed, or intended, for long term storage. The /work filesystem periodically is filled near capacity and this requires files to be deleted to keep the system as a whole available for ongoing use. To date, we have used a somewhat manual process of warning the user community and relying upon voluntary file removal. This is no longer sufficient due to the number of users and the number of accumulated files (e.g. Tusker is currently precariously close to going off-line due to /work being filled). The prior method will be augmented going forward with the automatic removal of all files that have not been accessed for over 6 months. Artificial activity to circumvent this policy will be considered misuse of the system. Longer term file storage is offered by HCC on Attic for an annual fee. This year, that fee has dropped from $100/TB/year to $60/TB/year.
TRANSITION:
This policy will be implemented first on Tusker since /work there is almost out of space. Crane and Sandhills will be implemented soon thereafter. For the near future, while possible, the file purging will be done reversibly. Files will moved from the users /work directory, but will be held temporarily in a weekly purge directory.
Users may see if they have any files scheduled to be deleted by logging in and using the commands
hcc-purge
to see a summary and
hcc-purge -l (l as in list)
to list the files scheduled to be purged.
hcc-purge
Prints the calling user and associated groups disk usage and file count for files that match the HCC purge policy.
hcc-purge -l
Uses the less pager utility to list the file paths for candidate purge files for the user.
The list can also be accessed at the following path:
/lustre/purge/current/${USER}.list
This is not an academic exercise. /work on Tusker is over 90% full; attempts to clear the filesystem with the former method have been unsuccessful.
SUPPORT:
With any change, some challenges will be encountered. Please contact hcc-support@unl.edu if you have concerns or need assistance moving your data to long term storage.
For details on how to check for files scheduled to be purged please see
https://hcc-docs.unl.edu/display/HCCDOC/Handling+Data .
For details concerning Attic storage please see
http://hcc.unl.edu/attic .
For details concerning transferring files please see
https://hcc-docs.unl.edu/display/HCCDOC/High-Speed+Data+Transfers .
Our intent is to remove the least valuable files from /work filesystems while enabling HCC systems to continue to be used by as many NU researchers as possible. - 2016-07-25: HCC Sandhills Downtime Planned - August 1, 2016
Category: MaintenanceThis maintenance outage for software updates affects Sandhills only. Jobs which cannot complete before August 1st at 6:00am will be held in queue until the maintenance is complete. A follow-up announcement will be posted when the system is ready for production use.
show details...To minimize the impact to running jobs we are declaring a downtime for Sandhills to complete this work. We will use this maintenance window to update various software components across the cluster. The Sandhills login node will also be updated and will require users to log off. Users will be denied access to the Sandhills login node until the maintenance is completed. - 2016-02-17: Sandhills: Maintenance complete
Category: General AnnouncementMaintenance is complete on Sandhills. Please let us know of any troubles using the cluster by sending email to hcc-support@unl.edu
show details...Maintenance is complete on Sandhills. Please let us know of any troubles using the cluster by sending email to hcc-support@unl.edu
Messages & Announcements
- 2016-08-03: Sandhills: Check job status
Category: System Failure - 2016-08-02: Sandhills: Maintenance complete
Category: General Announcement - 2016-08-01: /work file purge date delayed to Sept. 15, 2016
Category: General Announcement - 2016-07-25: HCC Sandhills Downtime Planned - August 1, 2016
Category: Maintenance - 2016-02-17: Sandhills: Maintenance complete
Category: General Announcement