Messages & Announcements

2015-08-21: Going forward with Tusker
Category: System Failure
This message directly affects only usage of Tusker.

UPDATE: Many users have retrieved essential files from Tusker at this time. If you are not ready for the /work filesystem to be repaired (and in the process erased), please contact me or HCC support at the addresses in the message appended below immediately. We will do our best to assist you, but indefinitely prolonging the outage until every last file is recovered is not realistic. Thus, please recover whatever files you need to asap. We plan to repair /work at 1 pm on Monday, August 24.

UPDATED POLICY: The /work filesystem on Tusker (and any other HCC /work filesystem) is designed to temporarily store data; it facilitates the use of the associated compute cores and large working space (RAM amounts) on the Tusker compute nodes. By design, it is not backed up, and is not thus a good option for archival storage of any kind, but in particular it should never be the location for a unique copy of data that cannot be recreated (eg. via repeated calculations). To better communicate this reality, when Tusker comes back up we will follow the lead of major supercomputing sites around the country and implement a maximum lifetime of files on /work. Thus, a file left dormant for over 6 months will be automatically deleted. This will keep the filesystem more available and healthy for immediate work, and is more consistent with a so-called scratch file system, which /work in-practice is. Please see references to Attic below or at http://unl.us8.list-manage2.com/track/click?u=ac50870de8549c469170bac61&id=f3100fa734&e=57a3ee57c5 if you would like a more appropriate longer term storage solution.

In spite of this excitement, let me wish you a successful fall semester!

Best regards,
David

David R. Swanson, Director
Holland Computing Center
University of Nebraska

Dear HCC User Community,

The work filesystem on Tusker failed catastrophically last week, and after several difficult days, a fragile but reasonably complete version of it was recovered and mounted. While this was a technical accomplishment, it was admittedly a hollow victory. Tusker remains in-practice down and the filesystem remains less than robust. This letter is to update you on the current status of Tusker and our plans moving forward.

To protect the data remaining on the Tusker work filesystem, we mounted it read only. This has allowed data to be retrieved for several days now, and resultant outgoing traffic has slowed from a sustained 6 Gbps to closer to 1 Gbps the last two days. This data exodus is unprecedented, and in some sense, is a validation of this approach. Reports are that much data is successfully being retrieved. We plan to allow this to continue until at least Monday, August 24. We currently intend to take Tusker down for a complete overhaul of the work filesystem August 24. This will result in the erasing of all data currently stored there. It will also result in a repaired and available system. Much data on /work can be replicated via further computation, and so it quickly becomes counterproductive to enforce an indefinite downtime to retrieve such data. If that is not the case for your group, please let me know asap!

*Please contact me at david.swanson@unl.edu if you have concerns. For technical questions your response time will be better if you contact hcc-support@unl.edu.*

Since I was on a vacation that ended this week, I have not yet sent my direct apologies for the current situation with Tusker. There are layers of reasons for it and mitigating factors concerning it that are largely irrelevant to you if research data was lost, or a hard deadline was missed, due to the recent failure. For those of you still reading, the failover mechanism was misconfigured by the original vendor, and while it is tempting to go after them, it is too late ... they are already out of business.

If critical data was lost, the real failure was perhaps communication, and I thus do sincerely apologize. We designed the Tusker work filesystem to (1) stand up to the pounding a shared system receives from our growing user base, to (2) be large enough to facilitate the processing demands of that user base and to (3) be really affordable. We did not design it to be fully replicated or backed up. For that purpose, we have a system known as Attic which stores files more reliably in Omaha and then fully replicates this more reliable storage in Lincoln. Attic is not free, and costs $100/TB/year. If you have data that is mission critical, it should not be stored exclusively on /work of any of our machines. That is not what /work is designed to do. Attic is a far safer choice.

Again, please let me know if there is something I can do to make this situation less painful. A further, shorter letter will follow this Friday.

Best regards,
David

David R. Swanson, Director
Holland Computing Center
University of Nebraska
2015-08-14: Tusker: Data recovery update for /work filesystem
Category: General Announcement
After much time and effort, the /work filesystem of Tusker has been mounted read-only and is by best estimates in a fragile but complete state. While this would appear to be good news, there is an equally bad component as well.

The filesystem metadata corruptions are such that we simply cannot mount it read/write and continue using it with the data as is. The filesystem will require a wipe and reformat before /work is again usable and Tusker can be opened for job submission. Herein lies the 'bad' of the situation.

We are opening up the login node and tusker-xfer transfer nodes with /work mounted read-only for the time being. This should allow you to log in and retrieve any critical data you might have on /work. At some point, with the current plan being in the ~1 week from now range, we will have to move ahead with the reformat so Tusker is once again a usable cluster. We will revisit our plan next week after we have a chance to receive feedback from users.

For users who have treated /work as the scratch filesystem it is intended to be this will hopefully not be a major issue. However, for those who have treated it as a long-term storage solution, of which it is not, and those with millions of small files this may prove difficult. Loss of data is undesirable, but so is an unused HPC cluster. Experience has shown that it will always take longer to transfer files than one would hope, and at some point we will have to move on with the format else risk a never-ending downtime. We also know from past experience that copying everything is simply not possible in a reasonable timeframe.

Unfortunately in the read-only state it is impossible to let you delete/move files which might otherwise aid in your decision of what is important and what isn't. While HCC staff can offer assistance on how to transfer to/from our resources, our intention is to not do this copying automatically for you. No guarantees are ever made on /work as a scratch filesystem, and we simply cannot determine what is and is not important. For any assistance, please contact hcc-support@unl.edu.

Additional details and an update on recovery efforts or any changes to this plan will likely follow early next week.

After much time and effort, the /work filesystem of Tusker has been mounted read-only and is by best estimates in a fragile but complete state. While this would appear to be good news, there is an equally bad component as well.

The filesystem metadata corruptions are such that we simply cannot mount it read/write and continue using it with the data as is. The filesystem will require a wipe and reformat before /work is again usable and Tusker can be opened for job submission. Herein lies the 'bad' of the situation.

We are opening up the login node and tusker-xfer transfer nodes with /work mounted read-only for the time being. This should allow you to log in and retrieve any critical data you might have on /work. At some point, with the current plan being in the ~1 week from now range, we will have to move ahead with the reformat so Tusker is once again a usable cluster. We will revisit our plan next week after we have a chance to receive feedback from users.

For users who have treated /work as the scratch filesystem it is intended to be this will hopefully not be a major issue. However, for those who have treated it as a long-term storage solution, of which it is not, and those with millions of small files this may prove difficult. Loss of data is undesirable, but so is an unused HPC cluster. Experience has shown that it will always take longer to transfer files than one would hope, and at some point we will have to move on with the format else risk a never-ending downtime. We also know from past experience that copying everything is simply not possible in a reasonable timeframe.

Unfortunately in the read-only state it is impossible to let you delete/move files which might otherwise aid in your decision of what is important and what isn't. While HCC staff can offer assistance on how to transfer to/from our resources, our intention is to not do this copying automatically for you. No guarantees are ever made on /work as a scratch filesystem, and we simply cannot determine what is and is not important. For any assistance, please contact hcc-support@unl.edu.

Additional details and an update on recovery efforts or any changes to this plan will likely follow early next week.
2015-08-13: Sandhills: Back online, downtime has ended
Category: General Announcement
Sandhills is now available for use after a much needed downtime.

The primary goal of the downtime was to bring the Sandhills cluster more in line with how Tusker and Crane operate both from an administrative view and a user/software experience view.

As part of the downtime, Sandhills was switched to the Lmod environment modules package. Some module names may have changed slightly to match the conventions used on Tusker and Crane. Also, some modules that were believed to be deprecated were removed. If you require a module that is no longer available, please let us know via hcc-support@unl.edu.

The system-wide Python versions provided via the python 2.7, 3.3, and 3.4 modules were changed to the Anaconda Python distribution from Continuum Analytics. For more information on Anaconda, please see https://hcc-docs.unl.edu/x/kImx.

Other important but less visible changes include bringing the SLURM scheduler and /home filesystems in line with their counterparts on our other clusters.

If you encounter any issues or notice an undesired change in behavior of the cluster, please lest us know via hcc-support@unl.edu.

Sandhills is now available for use after a much needed downtime.

The primary goal of the downtime was to bring the Sandhills cluster more in line with how Tusker and Crane operate both from an administrative view and a user/software experience view.

As part of the downtime, Sandhills was switched to the Lmod environment modules package. Some module names may have changed slightly to match the conventions used on Tusker and Crane. Also, some modules that were believed to be deprecated were removed. If you require a module that is no longer available, please let us know via hcc-support@unl.edu.

The system-wide Python versions provided via the python 2.7, 3.3, and 3.4 modules were changed to the Anaconda Python distribution from Continuum Analytics. For more information on Anaconda, please see https://hcc-docs.unl.edu/x/kImx.

Other important but less visible changes include bringing the SLURM scheduler and /home filesystems in line with their counterparts on our other clusters.

If you encounter any issues or notice an undesired change in behavior of the cluster, please lest us know via hcc-support@unl.edu.
2015-08-11: Sandhills: Downtime extended until August 12th
Category: Maintenance
Due to the unexpected filesystem issues with Tusker's /work filesystem we must extend the downtime of Sandhills until tomorrow. The admins involved have spent much of their time last night and today working to repair the /work filesystem on Tusker in addition to the planned Sandhills maintenance. They are also (surprisingly) human and require some rest at this point.

At this time we expect Sandhills to be opened back up and operational by close of business tomorrow (August 12th) at the latest.

Due to the unexpected filesystem issues with Tusker's /work filesystem we must extend the downtime of Sandhills until tomorrow. The admins involved have spent much of their time last night and today working to repair the /work filesystem on Tusker in addition to the planned Sandhills maintenance. They are also (surprisingly) human and require some rest at this point.

At this time we expect Sandhills to be opened back up and operational by close of business tomorrow (August 12th) at the latest.
2015-08-11: Tusker: Update on /work filesystem outage
Category: System Failure
The /work filesystem outage for Tusker continues, with the prognosis looking less than optimal.

Filesystem consistency checks run throughout the day have unfortunately not yielded a stable, mountable filesystem with which we can use, even in a read-only state. There are a few more options we wish to try in an effort to allow recovery of data from /work, but it will most certainly be tomorrow (August 12th) at the earliest before we are able to make a statement one way or another.

The root cause of the issue was a hardware failure in which the primary controller hosting the Lustre metadata failed over to its "redundant" backup controller, an undesired but normally non-destructive operation. We have since discovered that these controllers were configured by the vendor in a non-standard and non-default way in which no cache mirroring was done, thus causing the loss of all cached data on the controller when the failover occurred. Why anyone would configure the system this way remains a mystery, and unfortunately we trusted the vendor provided solution to be correct while in reality it was not.

This data loss resulted in the corruption of the Lustre metadata, which is essential to the operation of the filesystem. Unless it can be repaired or recovered in any capacity, there is no reasonable way to mount and recover files from /work.

We will send additional announcements as we have more information. Until then Tusker will remain offline.

The /work filesystem outage for Tusker continues, with the prognosis looking less than optimal.

Filesystem consistency checks run throughout the day have unfortunately not yielded a stable, mountable filesystem with which we can use, even in a read-only state. There are a few more options we wish to try in an effort to allow recovery of data from /work, but it will most certainly be tomorrow (August 12th) at the earliest before we are able to make a statement one way or another.

The root cause of the issue was a hardware failure in which the primary controller hosting the Lustre metadata failed over to its "redundant" backup controller, an undesired but normally non-destructive operation. We have since discovered that these controllers were configured by the vendor in a non-standard and non-default way in which no cache mirroring was done, thus causing the loss of all cached data on the controller when the failover occurred. Why anyone would configure the system this way remains a mystery, and unfortunately we trusted the vendor provided solution to be correct while in reality it was not.

This data loss resulted in the corruption of the Lustre metadata, which is essential to the operation of the filesystem. Unless it can be repaired or recovered in any capacity, there is no reasonable way to mount and recover files from /work.

We will send additional announcements as we have more information. Until then Tusker will remain offline.