Messages & Announcements

  • 2013-07-03:  Sandhills Downtime Announcement
    Category:  Maintenance

    Sandhills downtime is planned for next Tuesday, July 9. A significant hardware addition will be installed. Downtime duration is estimated to be 1 day. Please contact hcc-support@unl.edu if you have concerns or questions.


    Sandhills will be significantly expanded, adding a total of 44 servers, each with 64 cores (as in Tusker) and 192 GB RAM (3 GB/core). The Infiniband fabric will need to be reconfigured. New nodes will be added to the esquared and track2 partitions, but will be available in the guest partition as well.

    Jobs which cannot complete before July 9 at 9:00am will be held until the maintenance is complete.

  • 2013-07-02:  Sandhills maintenance on July 9
    Category:  Maintenance

    On July 9th, Sandhills will be unavailable for hardware maintenance. Jobs which cannot complete before July 9 at 9:00am will be held until the maintenance is complete.

    As part of this maintenance, we are recabling the InfiniBand fabric to accommodate 44 newly purchased worker nodes. We anticipate this work will be finished by 5:00pm the same day.


    On July 9th, Sandhills will be unavailable for hardware maintenance. Jobs which cannot complete before July 9 at 9:00am will be held until the maintenance is complete.

    As part of this maintenance, we are recabling the InfiniBand fabric to accommodate 44 newly purchased worker nodes. We anticipate this work will be finished by 5:00pm the same day.

  • 2013-06-06:  SLURM help at HCC Friday, June 7, 1-3pm
    Category:  General Announcement

    Want help with the new SLURM batch system (or anything else related to Tusker)? HCC will host a meet-and-greet open house tomorrow afternoon (Friday June 7) at Schorr Center from 1-3pm. Simple snacks and beverages will be available. No appointment necessary.


    HCC will be running more of these, including different locations, over the rest of the summer. Please feel free to suggest locations, times, and topics.

  • 2013-06-01:  Work filesystem status on Tusker
    Category:  General Announcement

    Dear HCC users,

    As a result of the extensive re-working of the Tusker lustre filesytem some files on the work filesystem were lost due to what appears to be a bug in lustre. This only affects /work on tusker. If you have data on /home or another system, those are not affected.

    While /work is not promised to be preserved, we take very seriously the need to be careful with research data. < 1% of files were affected or lost, but I realize the wrong 0.7% of files could be very painful! If you have a small number of files, you might have escaped; if you had one-million, you didn't. By design /work is not backed up; however, we want to do whatever we can to assist you in recovering. Please let us know if there is any way we can help -- perhaps some files can just be regenerated. I'm terribly sorry for this -- it was not in any way expected, and this issue never arose in our pre-migration tests. That's not much consolation; we will do what we can to help make things as right as possible. Especially if you are under a time crunch, please contact me directly.

    In most cases where files were lost, the metadata was preserved, but file data was not. Thus, we can at least detail which files may need to be replaced/recomputed (those files and associated filenames are not there now). We have added a file named 201305_missing_files.txt to your /work directory: this contains the list of the files that were lost from each user's /work directory.

    I recommend a procedure like the following upon login to Tusker:

    1) cd $WORK

    --> this moves you to your /work directory

    2) ls 201305_missing_files.txt

    --> if this file is not there, you were not affected; go about your business with a grateful heart

    3) wc 201305_missing_files.txt

    --> You'll see something like this: 597 597 44867 201305_missing_files.txt
    --> That first number is the number of lines in that file, and is the number of your files flagged as missing

    4) ask us for help sorting through this (potentially large) file -- for instance, we can give you the number of files missing per a given directory. Of course, you may prefer to search through "201305_missing_files.txt" on your own

    Unfortunately, this process is necessary. As always, please let us know via hcc-support@unl.edu if there is any sign of strange behavior with the system. We will do our best to get things back to a state that is productive for you asap. I will be contacting as many research groups as possible directly over the next several working days.

    Sincerely,

    David R. Swanson, Director
    Holland Computing Center


  • 2013-05-31:  Tusker Back Online
    Category:  Maintenance

    Dear HCC User,

    Tusker is back open and ready for use. We believe the changes implemented during this downtime will significantly improve tusker. While /work has never been backed up and is designed to be utilized for temporary working data sets, we made every effort to preserve as much of the data on /work as possible. Let me recount the main points of why we had to go through this process:

    <1> Lustre previously was routinely failing for some usage cases. Further we had multiple close calls, where one more disk failure would have resulted in dramatic data loss.
    <2> The version of Lustre had lagged far behind the current version for some time -- the vendor refused to accelerate this.
    <3> The overall demand for space was nearing capacity.

    All data has been moved and over 99% (by file count) has been preserved. The overall process did result in an unexplained loss of 0.7% of the files. We have found examples of others experiencing this, but a detailed explanation for the cause, and worse, a solution, does not exist. Users are advised to check their files. HCC will be contacting users concerning this as possible lost files are catalogued. If this causes an inconvenience, please contact hcc-support@unl.edu asap -- we will do what we can to alleviate this impact.

    In the near future, please notify hcc-support@unl.edu ASAP if you have any file related questions or concerns. We have done extra checks to avoid a further downtime; while none is expected, as always, we will notify HCC users if future conditions require another outage. We would not expect something like this again for a very long time.

    We now have 523 TB of space; with a current Lustre implementation, completely protected by RAID6 at the storage target level. There is also an additional 1.6TB of /scratch space on each node.

    We have installed SLURM as the new scheduler. Please see https://hcc-docs.unl.edu/display/HCCDOC/Submitting+Jobs . In many cases your existing PBS submission scripts will work as is via "qsub". HCC staff will be ready to answer questions that may arise -- feel free to stop by if you have any trouble.

    Best regards,

    David R. Swanson, Director
    Holland Computing Center