Messages & Announcements

  • 2015-08-14:  Tusker: Data recovery update for /work filesystem
    Category:  General Announcement

    After much time and effort, the /work filesystem of Tusker has been mounted read-only and is by best estimates in a fragile but complete state. While this would appear to be good news, there is an equally bad component as well.

    The filesystem metadata corruptions are such that we simply cannot mount it read/write and continue using it with the data as is. The filesystem will require a wipe and reformat before /work is again usable and Tusker can be opened for job submission. Herein lies the 'bad' of the situation.

    We are opening up the login node and tusker-xfer transfer nodes with /work mounted read-only for the time being. This should allow you to log in and retrieve any critical data you might have on /work. At some point, with the current plan being in the ~1 week from now range, we will have to move ahead with the reformat so Tusker is once again a usable cluster. We will revisit our plan next week after we have a chance to receive feedback from users.

    For users who have treated /work as the scratch filesystem it is intended to be this will hopefully not be a major issue. However, for those who have treated it as a long-term storage solution, of which it is not, and those with millions of small files this may prove difficult. Loss of data is undesirable, but so is an unused HPC cluster. Experience has shown that it will always take longer to transfer files than one would hope, and at some point we will have to move on with the format else risk a never-ending downtime. We also know from past experience that copying everything is simply not possible in a reasonable timeframe.

    Unfortunately in the read-only state it is impossible to let you delete/move files which might otherwise aid in your decision of what is important and what isn't. While HCC staff can offer assistance on how to transfer to/from our resources, our intention is to not do this copying automatically for you. No guarantees are ever made on /work as a scratch filesystem, and we simply cannot determine what is and is not important. For any assistance, please contact hcc-support@unl.edu.

    Additional details and an update on recovery efforts or any changes to this plan will likely follow early next week.


    After much time and effort, the /work filesystem of Tusker has been mounted read-only and is by best estimates in a fragile but complete state. While this would appear to be good news, there is an equally bad component as well.

    The filesystem metadata corruptions are such that we simply cannot mount it read/write and continue using it with the data as is. The filesystem will require a wipe and reformat before /work is again usable and Tusker can be opened for job submission. Herein lies the 'bad' of the situation.

    We are opening up the login node and tusker-xfer transfer nodes with /work mounted read-only for the time being. This should allow you to log in and retrieve any critical data you might have on /work. At some point, with the current plan being in the ~1 week from now range, we will have to move ahead with the reformat so Tusker is once again a usable cluster. We will revisit our plan next week after we have a chance to receive feedback from users.

    For users who have treated /work as the scratch filesystem it is intended to be this will hopefully not be a major issue. However, for those who have treated it as a long-term storage solution, of which it is not, and those with millions of small files this may prove difficult. Loss of data is undesirable, but so is an unused HPC cluster. Experience has shown that it will always take longer to transfer files than one would hope, and at some point we will have to move on with the format else risk a never-ending downtime. We also know from past experience that copying everything is simply not possible in a reasonable timeframe.

    Unfortunately in the read-only state it is impossible to let you delete/move files which might otherwise aid in your decision of what is important and what isn't. While HCC staff can offer assistance on how to transfer to/from our resources, our intention is to not do this copying automatically for you. No guarantees are ever made on /work as a scratch filesystem, and we simply cannot determine what is and is not important. For any assistance, please contact hcc-support@unl.edu.

    Additional details and an update on recovery efforts or any changes to this plan will likely follow early next week.

  • 2015-08-13:  Sandhills: Back online, downtime has ended
    Category:  General Announcement

    Sandhills is now available for use after a much needed downtime.

    The primary goal of the downtime was to bring the Sandhills cluster more in line with how Tusker and Crane operate both from an administrative view and a user/software experience view.

    As part of the downtime, Sandhills was switched to the Lmod environment modules package. Some module names may have changed slightly to match the conventions used on Tusker and Crane. Also, some modules that were believed to be deprecated were removed. If you require a module that is no longer available, please let us know via hcc-support@unl.edu.

    The system-wide Python versions provided via the python 2.7, 3.3, and 3.4 modules were changed to the Anaconda Python distribution from Continuum Analytics. For more information on Anaconda, please see https://hcc-docs.unl.edu/x/kImx.

    Other important but less visible changes include bringing the SLURM scheduler and /home filesystems in line with their counterparts on our other clusters.

    If you encounter any issues or notice an undesired change in behavior of the cluster, please lest us know via hcc-support@unl.edu.


    Sandhills is now available for use after a much needed downtime.

    The primary goal of the downtime was to bring the Sandhills cluster more in line with how Tusker and Crane operate both from an administrative view and a user/software experience view.

    As part of the downtime, Sandhills was switched to the Lmod environment modules package. Some module names may have changed slightly to match the conventions used on Tusker and Crane. Also, some modules that were believed to be deprecated were removed. If you require a module that is no longer available, please let us know via hcc-support@unl.edu.

    The system-wide Python versions provided via the python 2.7, 3.3, and 3.4 modules were changed to the Anaconda Python distribution from Continuum Analytics. For more information on Anaconda, please see https://hcc-docs.unl.edu/x/kImx.

    Other important but less visible changes include bringing the SLURM scheduler and /home filesystems in line with their counterparts on our other clusters.

    If you encounter any issues or notice an undesired change in behavior of the cluster, please lest us know via hcc-support@unl.edu.

  • 2015-08-11:  Sandhills: Downtime extended until August 12th
    Category:  Maintenance

    Due to the unexpected filesystem issues with Tusker's /work filesystem we must extend the downtime of Sandhills until tomorrow. The admins involved have spent much of their time last night and today working to repair the /work filesystem on Tusker in addition to the planned Sandhills maintenance. They are also (surprisingly) human and require some rest at this point.

    At this time we expect Sandhills to be opened back up and operational by close of business tomorrow (August 12th) at the latest.


    Due to the unexpected filesystem issues with Tusker's /work filesystem we must extend the downtime of Sandhills until tomorrow. The admins involved have spent much of their time last night and today working to repair the /work filesystem on Tusker in addition to the planned Sandhills maintenance. They are also (surprisingly) human and require some rest at this point.

    At this time we expect Sandhills to be opened back up and operational by close of business tomorrow (August 12th) at the latest.

  • 2015-08-11:  Tusker: Update on /work filesystem outage
    Category:  System Failure

    The /work filesystem outage for Tusker continues, with the prognosis looking less than optimal.

    Filesystem consistency checks run throughout the day have unfortunately not yielded a stable, mountable filesystem with which we can use, even in a read-only state. There are a few more options we wish to try in an effort to allow recovery of data from /work, but it will most certainly be tomorrow (August 12th) at the earliest before we are able to make a statement one way or another.

    The root cause of the issue was a hardware failure in which the primary controller hosting the Lustre metadata failed over to its "redundant" backup controller, an undesired but normally non-destructive operation. We have since discovered that these controllers were configured by the vendor in a non-standard and non-default way in which no cache mirroring was done, thus causing the loss of all cached data on the controller when the failover occurred. Why anyone would configure the system this way remains a mystery, and unfortunately we trusted the vendor provided solution to be correct while in reality it was not.

    This data loss resulted in the corruption of the Lustre metadata, which is essential to the operation of the filesystem. Unless it can be repaired or recovered in any capacity, there is no reasonable way to mount and recover files from /work.

    We will send additional announcements as we have more information. Until then Tusker will remain offline.


    The /work filesystem outage for Tusker continues, with the prognosis looking less than optimal.

    Filesystem consistency checks run throughout the day have unfortunately not yielded a stable, mountable filesystem with which we can use, even in a read-only state. There are a few more options we wish to try in an effort to allow recovery of data from /work, but it will most certainly be tomorrow (August 12th) at the earliest before we are able to make a statement one way or another.

    The root cause of the issue was a hardware failure in which the primary controller hosting the Lustre metadata failed over to its "redundant" backup controller, an undesired but normally non-destructive operation. We have since discovered that these controllers were configured by the vendor in a non-standard and non-default way in which no cache mirroring was done, thus causing the loss of all cached data on the controller when the failover occurred. Why anyone would configure the system this way remains a mystery, and unfortunately we trusted the vendor provided solution to be correct while in reality it was not.

    This data loss resulted in the corruption of the Lustre metadata, which is essential to the operation of the filesystem. Unless it can be repaired or recovered in any capacity, there is no reasonable way to mount and recover files from /work.

    We will send additional announcements as we have more information. Until then Tusker will remain offline.

  • 2015-08-10:  Tusker unplanned downtime
    Category:  System Failure

    UPDATE: Tusker emergency system maintenance starting Monday morning

    Emergency system maintenance is necessary to correct the issues present on Tusker. The corrective steps will include a Lustre filesystem consistency scan which will require taking the system offline. It is anticipated the scan will take many hours to complete. Follow-up checks may be necessary after this initial scan so no estimates will be made for when the system will be returned to service at this time. Jobs which are running will be re-queued but held in pending state. The login and tusker-xfer systems will not be available during the filesystem scan to minimize issues. We will send further announcements when circumstances warrant or when the system is made available again.


    UPDATE: Tusker emergency system maintenance starting Monday morning

    Emergency system maintenance is necessary to correct the issues present on Tusker. The corrective steps will include a Lustre filesystem consistency scan which will require taking the system offline. It is anticipated the scan will take many hours to complete. Follow-up checks may be necessary after this initial scan so no estimates will be made for when the system will be returned to service at this time. Jobs which are running will be re-queued but held in pending state. The login and tusker-xfer systems will not be available during the filesystem scan to minimize issues. We will send further announcements when circumstances warrant or when the system is made available again.

Pages