Failure of MailHosting service and ECCS Staff Email on May 19th and 20th
Updated: May 25, 2018
May 25, 2018
After the maintenance work on the file servers that began around 9:00 AM on Saturday, we were not able to resume the MailHosting service and ECCS Staff Email, despite planning to finish before 1:00 PM. Furthermore, the delivery of emails to all users was halted until around 5:00 PM on Sunday, May 20th. We apologized for any inconvenience caused by the delay.
Cause of the failure
During the maintenance work carried out on Saturday, May 19th, we updated the operating system of a file server (OneFS, DELL EMC) to the latest version, which is also shared with the MailHosting service and various other ECCS services. As a result, a NFS file lock function used by the email server could no longer work properly. This is the cause of this failure.
It took a long time to find out how to deal with the problem and find a solution, as it was not documented in the administrator's manual.
Because of function maintenance and security measures, it is necessary for us to regularly upgrade the software. While it is difficult to completely eliminate failures caused when upgrading software versions, it is possible to reduce the probability of a failure. We are going to strengthen (1) advance preparation (review work contents, check operation in a test environment, etc.) and (2) backup plans (alternate plans when something fails, or plans to reduce negative impact) as much as possible.
As a measure that can be taken by users, it is common to use multiple independent email services at the same time in order to be prepared for an email suspension due to a failure. This is especially effective when a suspension of email services cause a large impact. In addition to the MailHosting service and the ECCS Staff Email service which was the cause of trouble this time, the Information Technology Center also provides ECCS Cloud Email service using Google G Suite (Gmail).