SAN Disk failure

Virtually all of Informatics storage is via our redundant Storage Array Network (SAN). All our arrays are configured as either level 1, 5 or 10 RAID. Meaning that if one of the physical hard disks fails, the data on the RAID array remains intact, allowing us to replace the failed disk without any interruption to service.

Though most times users never notice a single hard disk failure, last Thursday night (21/2/2013) one physical disk making up a RAID5 array did fail, and unusually this caused the array to go offline briefly. This is not normally the case. Unfortunately one of our servers was writing to the array at this time, which caused the kernel to report an error and took the mounted device off line. In this case it affected some 5 or so group file space areas stored on that array. These group areas remained off line until the computing staff were able to investigate the problem, check and repair any potential problems, and re-enable the group areas.

We’ve been in touch with the suppliers of this SAN unit, as this is not the expected behaviour, and they’ve pointed out that the firmware on the SAN unit is out of date, and we are there for assuming this was a bug in the old firmware, which has since been rectified.

We will be looking to update the firmware to the recommended version, but though it should be safe to apply the update to the running hardware, we will schedule some downtime to avoid the risk of any problems affecting the data on the array. Unfortunately this will mean disruption to any users with data on the array. We will notify users once we have a date and time in mind.

Neil

About neilb

Computing staff at the University of Edinburgh. Part of the Services Unit.
This entry was posted in News. Bookmark the permalink.

Leave a Reply