Dealing with a dropped disk in Transparent RAID

Pre-requisite

Restoring a failed disk in Transparent RAID

Overview

A disk will be dropped from the Transparent RAID if it fails.
Nonetheless, a disk failure can sometimes be temporary or the result of a glitch.

The array will drop the disk if it is unable to communicate with the disk and after all probing attempts have failed during a write attempt.
There might be times when these issues are the result of driver issues or the disk locking up due to firmware issues.

A dropped disk might come back online and go back to functioning properly and usually after a reboot.

Big fat warning

Once a disk is dropped from an array, the proper procedure is to recover such disk.
In the next section, we will discuss of an alternative action which is to unfail the disk, but make sure you fully understand the implications of such action.

Once a disk is dropped, all changes (writes) happen through the live data reconstruction algorithm. The only way to capture those changes is by restoring the dropped disk.
Choosing to unfail a disk rather than to recover it will essentially discard all changes written to that disk while it was in the dropped state. The implication of discarding those data changes are fully dependent of your usage of your data.
A failed read operation will never cause a disk to be dropped as there is no integrity risk, and the read is simply made successful through live reconstruction. So, if you see a dropped disk, then it means there was a write operation that executed through live reconstruction and the disk is being dropped given the data integrity implications.
So, make sure to fully validate the integrity of your data before choosing to unfail rather than to recover.

Dealing with a dropped disk

Note, in the instructions below, if the failed disk is your only parity disk, rather than restoring, it is best just recreate the parity!
You can restore a PPU, but if it is your only PPU that has failed, then re-creating it is going to be faster.
You will need to override the configuration, swap the disk, and unfail it before you can re-Create the parity.
When in multi-PPU mode, it is best to restore the failed PPU in offline mode. An online restore will work, but the disk will be over taxed from the combined load of the recovery operation and regular data write parity updates.

The way to deal with a dropped disk is as follow:

1. Stop the array.

2. Test the dropped disk to see if it is still functioning by running some SMART tests on it using the tRAID UI or a 3rd party SMART program.

3. If the disk has failed, replace it and restore the data.

4. If the disk has not failed, the next step is to check the integrity of its data.
You do that by going to the Windows Disk Management console, bringing the source disk online, adding a drive letter to the disk, and checking the data by which ever mean is appropriate.
If you don’t have a real way to validating the integrity of your data, then you should highly consider recovery the data and preferably to another disk.
The reason this is being done with the array offline is so that the parity is not affected by these activities.
You should also check the Windows event viewer for possible clues as to why the disk was dropped.

5. If the data has issues, you should recover the disk. You can recover the data to the original disk or recover it to another disk.

6. If the data is fine, you should go back to the tRAID UI, override the configuration and unfail the disk, and then run the Verify & Sync task on the array.
Overriding configuration
After overriding the RAID configuration as above, you will right-click on the failed disk and choose to unfail it.

After unfailing the disk and if you wish, you can start the array first before running the Verify & Sync task so that you have immediate access to the data and have the parity sync run in the background.
After the Verify & Sync has completed, make notes of the blocks that were recomputed. Those are the blocks where data changes that happened while the disk was in a dropped state were discarded.
Note that by choosing not the recover the failed disk and choosing to merely re-instate it, you will lose any data that was written to the affected disk while it was in the dropped state.
You should consider recovering the disk instead of unfailing it if you care about any data that might have been written to the disk while it was in the dropped state.


What’s next?

RAID Expansion in Transparent RAID

Be Sociable, Share!

No comments yet.

Leave a Reply

two × 2 =