Messages & Announcements

  • 2016-01-02:  SANDHILLS outage - job scheduling unavailable
    Category:  System Failure

    The SANDHILLS cluster is currently unavailable for running jobs due to hardware failure of the head node. The login node will remain up for the time being and files should be accessible. All scheduling functionality (sbatch, squeue, etc) on the other hand will almost certainly fail.

    We will work towards a solution first thing on Monday and send additional announcements as necessary.


  • 2015-11-28:  SANDHILLS available after power outage.
    Category:  System Failure

    Partial power outage in SANDHILLS resolved.


    A power outage on Thanksgiving Day caused some worker nodes in SANDHILLS to become unavailable. Power is restored and SANDHILLS is fully operational. It's likely that some jobs were killed because of the outage but we think that no files were impacted. Please send an email to hcc-support@unl.edu if you find any problems.

  • 2015-09-22:  Resources available via Crane
    Category:  General Announcement

    Dear HCC community,

    A new partition, called tmp_anvil, with 50+ nodes has been added to Crane temporarily. These nodes have two Intel E5-2650 v3 nodes, for a total of 20 cores running at 2.3GHz. Additionally, each node has 256GB of RAM. The only downsides to these nodes are that this partition is only temporary, and that they do not have InfiniBand. MPI jobs should be compiled as SMP if they are going to run on these nodes. Additionally, you will not be able to run on more than one node per job (but up to 20 tasks per node). SMP jobs needing a lot of memory is a good fit for these nodes. Per node these machines are now the most capable Intel boxes at HCC.


    To submit jobs to this partition from Crane, please use

    [ #SBATCH --partition=tmp_anvil ].

    Further questions should be directed to hcc-support@unl.edu.

    HCC has begun the process of building a local research cloud which we will name Anvil, an obvious reference to anvil clouds periodically seen on the great plains. The hardware for this machine has been purchased, but building the underlying infrastructure will be an ongoing, incremental process. In the mean time, we will provision some significant compute hardware as a part of Crane instead of letting it sit idle. We will warn you when we are ready to repurpose these nodes back to Anvil; when the nodes are repurposed, started jobs will be allowed to finish first.

  • 2015-08-24:  Tusker available for use
    Category:  General Announcement

    Tusker is back open for use. The work filesystem has been rebuilt and users may begin to write to the currently empty filesystem and start running jobs. Please contact hcc-support@unl.edu if you encounter any problems.


  • 2015-08-21:  Going forward with Tusker
    Category:  System Failure

    This message directly affects only usage of Tusker.

    UPDATE: Many users have retrieved essential files from Tusker at this time. If you are not ready for the /work filesystem to be repaired (and in the process erased), please contact me or HCC support at the addresses in the message appended below immediately. We will do our best to assist you, but indefinitely prolonging the outage until every last file is recovered is not realistic. Thus, please recover whatever files you need to asap. We plan to repair /work at 1 pm on Monday, August 24.

    UPDATED POLICY: The /work filesystem on Tusker (and any other HCC /work filesystem) is designed to temporarily store data; it facilitates the use of the associated compute cores and large working space (RAM amounts) on the Tusker compute nodes. By design, it is not backed up, and is not thus a good option for archival storage of any kind, but in particular it should never be the location for a unique copy of data that cannot be recreated (eg. via repeated calculations). To better communicate this reality, when Tusker comes back up we will follow the lead of major supercomputing sites around the country and implement a maximum lifetime of files on /work. Thus, a file left dormant for over 6 months will be automatically deleted. This will keep the filesystem more available and healthy for immediate work, and is more consistent with a so-called scratch file system, which /work in-practice is. Please see references to Attic below or at http://unl.us8.list-manage2.com/track/click?u=ac50870de8549c469170bac61&id=f3100fa734&e=57a3ee57c5 if you would like a more appropriate longer term storage solution.

    In spite of this excitement, let me wish you a successful fall semester!

    Best regards,
    David

    David R. Swanson, Director
    Holland Computing Center
    University of Nebraska


    Dear HCC User Community,

    The work filesystem on Tusker failed catastrophically last week, and after several difficult days, a fragile but reasonably complete version of it was recovered and mounted. While this was a technical accomplishment, it was admittedly a hollow victory. Tusker remains in-practice down and the filesystem remains less than robust. This letter is to update you on the current status of Tusker and our plans moving forward.

    To protect the data remaining on the Tusker work filesystem, we mounted it read only. This has allowed data to be retrieved for several days now, and resultant outgoing traffic has slowed from a sustained 6 Gbps to closer to 1 Gbps the last two days. This data exodus is unprecedented, and in some sense, is a validation of this approach. Reports are that much data is successfully being retrieved. We plan to allow this to continue until at least Monday, August 24. We currently intend to take Tusker down for a complete overhaul of the work filesystem August 24. This will result in the erasing of all data currently stored there. It will also result in a repaired and available system. Much data on /work can be replicated via further computation, and so it quickly becomes counterproductive to enforce an indefinite downtime to retrieve such data. If that is not the case for your group, please let me know asap!

    *Please contact me at david.swanson@unl.edu if you have concerns. For technical questions your response time will be better if you contact hcc-support@unl.edu.*

    Since I was on a vacation that ended this week, I have not yet sent my direct apologies for the current situation with Tusker. There are layers of reasons for it and mitigating factors concerning it that are largely irrelevant to you if research data was lost, or a hard deadline was missed, due to the recent failure. For those of you still reading, the failover mechanism was misconfigured by the original vendor, and while it is tempting to go after them, it is too late ... they are already out of business.

    If critical data was lost, the real failure was perhaps communication, and I thus do sincerely apologize. We designed the Tusker work filesystem to (1) stand up to the pounding a shared system receives from our growing user base, to (2) be large enough to facilitate the processing demands of that user base and to (3) be really affordable. We did not design it to be fully replicated or backed up. For that purpose, we have a system known as Attic which stores files more reliably in Omaha and then fully replicates this more reliable storage in Lincoln. Attic is not free, and costs $100/TB/year. If you have data that is mission critical, it should not be stored exclusively on /work of any of our machines. That is not what /work is designed to do. Attic is a far safer choice.

    Again, please let me know if there is something I can do to make this situation less painful. A further, shorter letter will follow this Friday.

    Best regards,
    David

    David R. Swanson, Director
    Holland Computing Center
    University of Nebraska

Pages