Messages & Announcements

  • 2018-08-14:  Crane: /work filesystem downtime resolved
    Category:  General Announcement

    The /work filesystem for Crane is restored as of 11:30am. A filesystem check was completed with no errors found.

    Running jobs which were accessing /work stalled until the filesystem was restored. This may have caused jobs to exceed their time limit. There was no data loss from this outage.


  • 2018-08-14:  Crane: /work filesystem unplanned downtime
    Category:  System Failure

    The /work filesystem for Crane is partially unavailable. One of the storage servers stopped responding around 7:00am today. Initial debugging suggests an issue with the Lustre server using excessive kernel memory. The storage server has been rebooted and we are currently running filesystem checks.

    Pending jobs will be held until the maintenance is complete.


  • 2018-08-05:  Tusker /work temporary outage
    Category:  General Announcement

    The Tusker /work filesystem unexpectedly went offline at 12:25pm today. Maintenance was performed and the system restored less than an hour later. Accesses to /work may have hung, users should check to make sure jobs did not timeout. No data loss or other consequences from this event are expected.


  • 2018-08-04:  Crane: /work filesystem restored
    Category:  General Announcement

    The /work filesystem for Crane is back in service. A filesystem check was completed with no errors found. Jobs which were running during the outage may have exceeded their time limit or had errors accessing data on /work. There was no data loss from this outage.

    One of the storage servers experienced a disk drive failure, leading to a RAID controller reset. This caused the filesystems to report corruption and switch to read-only mode. While recovering the system, the initial consistency checks showed significant numbers of errors. However, these were likely spurious errors related to the journal. After the journal was replayed, the repair process went smoothly.


  • 2018-08-04:  Crane: /work filesystem unplanned downtime
    Category:  System Failure

    The /work filesystem for Crane is partially unavailable. One of the storage servers experienced hardware issues leading to corruption on the Lustre /work filesystem. Filesystem consistency checks are curently running, and the output unfortunately suggests that data loss or corruption is likely.

    Pending jobs will be held until the maintenance is complete.


Pages