Restoring a failed disk in Transparent RAID

Pre-requisite

Creating a Log RAID Configuration

Notice

When replacing a failed disk in your Transparent array, you do not need and should not format the replacement disk.
The engine will override the replacement disk, and so the content is has does not matter. However, it is best to no create a volume on the replacement disk to avoid issues such as the operating system still having a handle on that volume.
The replacement disk should ideally be a raw disk (a disk without partition or volume).

Tip: making use of live data reconstruction and recovering by just copying…

One thing that is not always apparent is that tRAID reconstructs a failed disk live such that one is able to read/write the data that would have resided on the failed disk. If you don’t have a replacement disk handy but have free space on another disk, one recovery option would be to copy the data from the failed tRAID disk to the other tRAID disk(s) with free space.

To do this, you first need to disable Storage Pooling caching if not already done, add drive letters to the tRAID disks, and then copy the data from the failed disk to the other tRAID disks under a folder. Finally, you would delete the RAID configuration and re-create one using the surviving disks. Again, this option is only if you don’t have a replacement disk available.

Similarly, users trying to be extra careful can copy the data off the failed disk to another disk just in case something goes wrong during recovery.
Note that these are just tips. The proper recovery procedures for tRAID are explained further below.

The configuration override feature

It is highly recommended that users test the recovery features in Transparent RAID and get comfortable with the tasks involved.
You shouldn’t wait to have a disk failure to put the system that is protecting your data to test.
You can skip this section and go to the next if you already have a failed disk. Otherwise, we will now discuss of how you can use the override feature to simulate disk failures.

To aide in simulating data failure and recovery scenarios, the Web UI has an override feature that let’s you fail and swap disks that are part of your array.

1. The override feature is engaged through the “Advanced Operations” menu as shown below.
1. Configuration Override

2. When engaged, the override feature will show a menu when you right-click on the grid with options to fail/unfail/swap a disk.
2. Fail a disk

3. A disk can be failed while the array is either offline or online. Both modes should be explored.
For this tutorial, we are failing the disk while the array is online.
If notifications are setup, you will be notified of the failure including log entries being present in the Web UI log file and the failed disk showing as failed in the UI.
3. Failed Disk

Restoring data while the array is online vs offline

The Transparent RAID platform is extremely powerful in that almost all operations can be executed while the array is either online or offline. Disks can be failed/unfailed/swapped/verified/restored while the array is live and being accessed by other applications.
Doing things while the array is online provides great convenience and eliminates down times.

Nonetheless, it is best to execute the Restore operation while the array is offline to minimize interferences and execute the process at the greater I/O speed possible. There is a performance penalty to executing the restore operation while the array is online. As the restore operation is of the greatest sensitivity, you want to execute it within the most sterile environment possible. You should do an online restore only if you really cannot afford any down time.

Dealing with a disk failure

1. This tutorial will showcase an array that is online with a failed disk.
Because of the parity protection and live data reconstruction, your OS will be oblivious of the disk failure.
As shown in the screenshot below, DRU2 has failed, but it is being reconstructed live such that the system is not impacted whatsoever.
4. Live Reconstructed Disk

2. If the override feature is engaged, the “Un-fail” and “Swap” menu options become available for a failed disk.
That is you can only swap a disk that has failed or has been marked as failed.
5. Swap or Un-fail in Override mode

3. Once a disk has failed, restoring it is very straight forward. Click on the “Restore” button and choose the replacement disk.
You also have the option of choosing to restore to the disk already in the array. This choice is applicable for cases where a disk is dropped out of the array but is proven to be in working condition later. That or you might have already swapped the failed disk with a working disk through the override menu.
6. Swap & Restore

4. As documented in http://wiki.flexraid.com/2013/06/22/creating-a-log-raid-configuration/, a Log RAID can optionally (but really recommended) be specified. Please refer to the linked topic and tooltips as shown in the screenshot below for further details.
7. Restore options

5. For this tutorial, we are choosing to restore the failed disk (DRU2) to DRU3 and to use a Log RAID.
8. Restoring to DRU3 With RAID Log

6. The Restore task will launch and you will be able to track its progress through the status window.
9. Restore Results

7. After the restore operation has completed, it is a good ideal to run the Verify+ operation to double check the state of the array.
10. Restored

8. Below is the result of the Verify+ task on our demo RAID following an online disk restoration.
11. Verify+ Results


What’s next?

Dealing with a dropped disk in Transparent RAID

Be Sociable, Share!

Revisions

2 Responses to “Restoring a failed disk in Transparent RAID”

Leave a Reply

sixteen − eight =