Yesterday evening, our SAN storage device satabeast2, suffered 2 disk failures. We can cope with 1 disk failure (RAID5), but following that first disk failure there is a period of time, while the redundancy is restored, that should a second disk fail then data will be lost. This is what happened yesterday.
There is a recovery mode feature of the satabeast that will attempt to recover data from marginal disk failures, but this requires rebooting the array, which will affect access to all data (whether currently affected by the disk failures or not) on that array.
We plan to start this process at 11am today, and it may take several hours to complete, but will will try to make it as short as possible.
Below is a list of the affected areas.
/group/project/statmt5 Currently affected by disk failures /group/project/statmt6 " /group/project/statmt7 " /group/project/statmt8 " /group/project/statmt9 " /group/project/statmt10 " /group/project/statmt11 " /group/project/statmt12 " /group/project/statmt13 " /group/project/ami9 Currently working, but will be affect by shutdown /group/project/ami10 " /group/project/cstr1 " /group/project/cstr2 " /group/project/cstr3 " /afs/inf.ed.ac.uk/group/project/imr/data "
We will update this post as work progresses.
12:30pm The rebuild is currently at 8% complete. It’s a 9TB array BTW.
2:15pm 24% complete
4:25pm 45% complete. The non-disk affected areas should be back, eg /group/project/ami9
5:30pm 55% complete. Will continue to monitor it over the weekend.
10:30pm 99% complete.
Saturday 8:15am: Unfortunately that last 1% was too much for it, and again one of the previously failed disks (that it was trying to recover from) failed again. So it’s looking unlikely we’ll be able to recover the data from statmt5 to statmt13.
Sunday: I have contacted the suppliers of the satabeast (by email) in case they have any other tricks up their sleeves.
Monday – 6/10/2014: Having analysed the logs and details that I sent them, the suppliers are planning on replacing the controller in the satabeast. This is scheduled for 11am onwards tomorrow – Tuesday. During this time the ami, cstr and imr-data areas will be unavailable.
Tuesday – Unfortunately the wrong part was shipped, so we’ll try the controller replacement again, but on Wednesday, same sort of time – 11am.
Wednesday – The controller has been replaced, and we are trying another recovery.
Thursday – The first attempt failed, we are trying a different one and currently it is 66% complete.
Friday 10th October
Rather surprisingly that last recovery attempt seems to have worked so far, and at least a couple of the group areas seem to be intact. I am running fsck on them and making them available (read only) as they pass the checks. The current statmt areas back are:
/group/project/statmt5 /group/project/statmt6 /group/project/statmt7 /group/project/statmt8 /group/project/statmt9 /group/project/statmt10 /group/project/statmt11 /group/project/statmt12 /group/project/statmt13
Though the file-systems pass consistency checks, it is possible that the files themselves contain corruption, we can’t know for sure one way or the other. Only the authors of those files would know if their contents are as expected.
5pm That’s all the affected group areas back now. I’ve left them as read only for now so that if any possible corruption isn’t made worse by things trying to open the files. If there’s some agreement that things do look normal, we can make them writeable again.
PS If you are actually affected by the loss of data, or use any of the group areas listed above, please let me know (neilb@inf) or leave a comment below.