Messages & Announcements

  • 2016-08-04:  HCC Tusker Downtime Planned - August 15, 2016
    Category:  Maintenance

    This maintenance outage for software updates affects Tusker only. Jobs which cannot complete before August 15th at 08:00am will be held in queue until the maintenance is complete. A follow-up announcement will be posted when the system is ready for production use.


    To minimize the impact to running jobs we are declaring a downtime for Tusker to complete this work. We will use this maintenance window to update various software components across the cluster. The Tusker login node will also be updated and will require users to log off. Users will be denied access to the Tusker login node until the maintenance is completed.

  • 2016-08-04:  HCC Crane Downtime Planned - August 10, 2016
    Category:  Maintenance

    This maintenance outage for software updates affects Crane only. Jobs which cannot complete before August 10th at 8:00am will be held in queue until the maintenance is complete. A follow-up announcement will be posted when the system is ready for production use.


    To minimize the impact to running jobs we are declaring a downtime for Crane to complete this work. We will use this maintenance window to update various software components across the cluster. The Crane login node will also be updated and will require users to log off. Users will be denied access to the Crane login node until the maintenance is completed.

  • 2016-08-03:  Sandhills: Check job status
    Category:  System Failure

    System issues today affected jobs on Sandhills. If you had pending jobs on Sandhills since the upgrade, please check their state as it may be necessary to resubmit these jobs.


    A software update was made to the Sandhills configuration management service this morning. It later became apparent that the service was giving incorrect data to the rest of the cluster, causing communications problems for the Slurm resource manager. This caused job startup failures for jobs starting during this window of time. It also made running jobs appear to have exited out to the controller, when in fact, they didn't. To clear up the inconsistent state of jobs between the controller and the workers, all jobs running on the workers that escaped the Slurm controller's knowledge were killed. This was unfortunately a necessary step to return the cluster to normal operation.

  • 2016-08-02:  Sandhills: Maintenance complete
    Category:  General Announcement

    Maintenance is complete on Sandhills. Please let us know of any troubles using the cluster by sending email to hcc-support@unl.edu


    Maintenance is complete on Sandhills. Please let us know of any troubles using the cluster by sending email to hcc-support@unl.edu

  • 2016-08-01:  /work file purge date delayed to Sept. 15, 2016
    Category:  General Announcement

    An earlier email had warned of the need to purge all files older than 6 months from /work to begin today (August 1, 2016). The HCC community has responded and removed enough data that all /work systems are currently at 80% of capacity or less. We are optimistic this will buy us a few more weeks of normal usage until September 15, 2016. Thank you to all of you who have cleared space in the last few weeks!

    We were specifically asked by several HCC groups to wait until a few weeks into the semester to make certain all are aware of this new policy (details are repeated below via the link). We are happy to be able to do so, barring an overrun before Sept. 15. Scripts will continue to run so users will be warned if they have files that have not been accessed for over 6 months which are subject to removal, but no files will be moved or removed until Sept. 15.

    Best regards,
    David Swanson


    This notice concerns a policy that affects all HCC machines and potentially all HCC users.

    SUMMARY:

    HCC is implementing a new automated file purge policy on the /work filesystem for all HCC machines. Starting August 1, 2016 (note: now Sept. 15, 2016) we will remove any files on /work which have not been accessed for at least 6 months. This will not affect the /home filesystems or the Attic storage system.

    EXPLANATION:

    The /work filesystem exists on each HCC machine for working files. It is not designed, or intended, for long term storage. The /work filesystem periodically is filled near capacity and this requires files to be deleted to keep the system as a whole available for ongoing use. To date, we have used a somewhat manual process of warning the user community and relying upon voluntary file removal. This is no longer sufficient due to the number of users and the number of accumulated files (e.g. Tusker is currently precariously close to going off-line due to /work being filled). The prior method will be augmented going forward with the automatic removal of all files that have not been accessed for over 6 months. Artificial activity to circumvent this policy will be considered misuse of the system. Longer term file storage is offered by HCC on Attic for an annual fee. This year, that fee has dropped from $100/TB/year to $60/TB/year.

    TRANSITION:

    This policy will be implemented first on Tusker since /work there is almost out of space. Crane and Sandhills will be implemented soon thereafter. For the near future, while possible, the file purging will be done reversibly. Files will moved from the users /work directory, but will be held temporarily in a weekly purge directory.

    Users may see if they have any files scheduled to be deleted by logging in and using the commands
    hcc-purge
    to see a summary and
    hcc-purge -l (l as in list)
    to list the files scheduled to be purged.

    hcc-purge
    Prints the calling user and associated groups disk usage and file count for files that match the HCC purge policy.

    hcc-purge -l
    Uses the less pager utility to list the file paths for candidate purge files for the user.

    The list can also be accessed at the following path:
    /lustre/purge/current/${USER}.list

    This is not an academic exercise. /work on Tusker is over 90% full; attempts to clear the filesystem with the former method have been unsuccessful.

    SUPPORT:

    With any change, some challenges will be encountered. Please contact hcc-support@unl.edu if you have concerns or need assistance moving your data to long term storage.

    For details on how to check for files scheduled to be purged please see
    https://hcc-docs.unl.edu/display/HCCDOC/Handling+Data .
    For details concerning Attic storage please see
    http://hcc.unl.edu/attic .
    For details concerning transferring files please see
    https://hcc-docs.unl.edu/display/HCCDOC/High-Speed+Data+Transfers .

    Our intent is to remove the least valuable files from /work filesystems while enabling HCC systems to continue to be used by as many NU researchers as possible.

Pages