In the late afternoon of Wednesday 19th of June, around 4:50pm, there was an unexpected glitch on the Informatics SAN (Storage Area Network). This caused some of our AFS storage to go offline for several hours. What follows is a recap of the event, and what we did to recover from it.
First a quick description of our setup. Our SAN consists of two “fabrics”, so that each server is physically attached to both fabrics. Each of our disk arrays are also attached to both fabrics. This means that any single machine has two routes to the data on the disk arrays. If there’s a failure in one part of the route between disk array and server via one fabric, there will still be a second route to the array via the other fabric. For those familiar with SCSI, IDE or SATA cables, can think of the SAN as a redundant network of cables between the computer and the disks. The physical connections of our SAN are “fibre channel” optical connections or FC for short. If we want to make a chunk of data on a disk array appear as a new “disk” on a server, we just have to allocate the space on the array, and then use “LUN masking” to make sure only the correct server can access it. One array can host data for several servers, and similarly one server can be accessing data (appearing as hard disk drives) from multiple arrays.
Providing there is no current association between an array and a server, then it is normally safe to power down/detach said array or server. Indeed we do this fairly routinely without affecting access between the remaining servers and disk arrays.
So what happened on the 19th?
We have two main disk arrays in the Forum: ifevo1 and ifevo2 with about 16TB and 64TB available storage each. The disk arrays are made up of a “controller” and several attached JBODs (Just a Bunch Of Disks) to expand the available storage. Both the controllers of ifevo1 and ifevo2 are due to be retired, and we’ve two new arrays (imaginatively called) ifevo3 and ifevo4 to replace them. We planned to disconnect the JBODs attached to the existing ifevo1 and 2 and attach them to the new ifevo3 and 4 respectively.
In preparation for the controller replacement, and given the spare space on ifevo2, then through the wonders of AFS, we were able to move all the AFS space from ifevo1 onto ifevo2 without interrupting access to that AFS space. This left ifevo1 devoid of any data, and so any servers that had been accessing data from it, were configured to no longer do so.
While ifevo1 was free of any data, and nothing depended on it, we took this opportunity to update firmware, which required reboots of the array, and as expected this didn’t cause any problems as nothing was accessing it.
We also installed the new ifevo3, which during it’s commissioning, required some restarts and disconnections from the fabrics, but again this caused no problems, as none of the existing servers had been configured to access it.
The final stage was to then shut down both ifevo1 and the new ifevo3 so that the JBOD could be disconnected from one and reattached to the other. Prior shutting down ifevo1, a test server (crocotta) was configured to access data on the JBOD while attached to ifevo1, so that we could see what would happen when the JBOD reappeared connected to ifevo3, as we plan to do this for real when replacing ifevo2 with ifevo4. Because ifevo1 now had “live” (albeit test) data on it, we shut down crocotta first, then shut down both ifevo1 and ifevo3 via their web management interfaces.
As expected, this was fine, as nothing active (crocotta) was running and accessing ifevo1 or 3. If there had been, then when the arrays were shut down we’d have noticed. The normal AFS servers were all only accessing ifevo2, which was still up and running.
A short while later we then physically removed the power from ifevo1 and 3, and at this point something unexpected happened. All the other servers on the SAN, including those not accessing any of the ifevos, started reporting problems with their FC connections. Mostly machines just noted an error and carried on, but some (probably the ones that were actively using their FC connection at the time) issued “DEVICE RESET” commands, followed by “DEVICE RESET FAILED” and then “TARGET RESET” messages. These messages were for both FC connections (one for each fabric) at the same time, so there was no remaining good route via either of the fabrics to their disk arrays. Eventually, once the resets had sorted themselves out, it left some (the active SAN connections) off-line. It was these off-line connections that contained the now affected AFS areas.
The operating system on the servers had detected the problem accessing the attached storage and to prevent further damage, remounted the filesystem in “read only” mode. To recover from this we umounted the affected filesystems (AFS partitions), ran “fsck” to check the underlying consistency of the filesystem. We then remounted the AFS partitions in the normal read write mode and any AFS volumes that didn’t reattach were salvaged (this is AFS’s equivalent of fsck). All of this was carried out without affecting those whose AFS space had been unaffected (even though they may have been on the same server). It took about 2 hours to salvage all the affected volumes, and so by about 8pm all the AFS space was available again. There may have been brief (less than a minute) breaks while AFS processes were restarted to reattach the AFS partitions, but generally these wouldn’t have been noticed.
What can we learn?
Though there’s little doubt that the turning off either or both of ifevo1 and ifevo3 caused the problem (unfortunately the power was removed from both simultaneously, so we can’t say which one (if any) was the specific cause). We have done this sort of thing before, without incident. The only differences being that we’ve not turned off two disk arrays at the same time (this time we used a remotely controlled plug bar to cut off power to all plugs at the same time), and in the past when we’ve turned off one array, we’ve done it manually by pressing power buttons. Possibly this slight delay between powering down the individual controllers in the array, allows the two fabrics to reconfigure at slightly different times, so that there is always at least one fabric responding and so always one active path between other servers and disk arrays on the fabric.
So it would seem sensible that the next time we are planning on turning off a disk array, then we should perhaps physically remove its FC connections one at a time, allowing any fabric reconfiguration to complete before removing the next FC connection. Basically we want to avoid both our fabrics to be reconfiguring at the same time, in case it was this that caused the other servers to lose contact with the other arrays on the SAN.
As mentioned earlier, we will have to carry out a similar procedure when replacing ifevo2 with ifevo4, so there is the risk that we could cause the same problem again. However this time we will not have the luxury of being able to move all the data off ifevo2 prior to the move of its JBODs. This means that we’d have to schedule downtime for the data and servers using ifevo2. As we’d shut things down cleanly, when they powered back up, there would be no delay while things were fsck’d or salvaged. So the downtime should be minimal. We will also try removing the FC connections, as suggested above, prior to turning things off to limit the possibility of non-related SAN traffic from being affected.