AutoReseed magically repairs a FailedAndSuspended Mailbox database copy.

In my previous blog post on Implementing and Configuring AutoReseed I’ve explained how to implement and configure AutoReseed in a Database Availability Group. In this blogpost I will explain what happens when a disk fails and AutoReseed kicks in.

We have configured a DAG, each Exchange 2013 server has 3 disks where 2 disks contain Mailbox database and the third disk act as a hot spare. If disk 1 fails the server should automatically reconfigure the spare disk and it should automatically reseed the Mailbox databases to this disk. The best way to test this is to set disk 1 offline in the Disk Management MMC snap-in.

When the disk is offline Exchange will notice almost immediately and activate the copy of the Mailbox databases on the 2nd Exchange server as expected. This is clearly visible when we execute a Get-MailboxDatabaseCopyStatus command. The Mailbox databases on AMS-EXCH01 are in a FailedAndSuspended state while they are mounted on server AMS-EXCH02.

image

What happens next is that a repair workflow is started. The workflow will try to resume the failed Mailbox database copy and if this fails the workflow will assign the spare volume to the failed disk. This is the exact workflow:

  1. The workflow will detect a Mailbox database copy that is in Failed and Suspended state for 15 minutes.
  2. Exchange will try to resume the failed Mailbox database copy 3 times with a 15 minutes interval.
  3. If Exchange cannot resume the failed copy Exchange will try to assign a spare volume for 5 times with a 1 hour interval.
  4. Exchange will try an InPlaceSeed with the SafeDeleteExistingFiles option for 5 times with a 1 hour interval.
  5. Once all retries are completed with no success the workflow will stop. When it is successful Exchange will finish the reseeding.
  6. When everything fails Exchange will wait for 3 days and see if the Mailbox database copy is still in Failed and Suspended state then starts the workflow from step 1.

All events are logged in the eventlog. There’s a special crimson channel for this which you can find in Applications and Services Logs | Microsoft | Exchange | HighAvailability | Seeding.

image

The first event that’s logged is EventID 1109 from the Auto Reseed Manager, indicating that something is wrong and that no data can be written to location C:\ExDbs\AMS-MDB01\AMS-MDB01.log. This makes sense since the disk has actually ‘failed’ and is no longer available.

image

Subsequent events in the eventlog will indicate the Auto Reseed Manager attempting to resume the copy of the Mailbox databases on the failed disk. As outlined earlier it will try this for three times, once every 15 minutes followed by an attempt to reassign a spare disk. Please note that it took almost an hour before Exchange moves to this step.

image

When the disk is succesffully reassigned Exchange will automatically start reseeding the replaced disk, indicated by EventID 1127 (still logged by the Auto Reseed manager):

image

Depending of the size of your Mailbox databases it can take quite long time for this step to finish.

You can use the mountvol utility again to check the new configuration. If all went well you’ll see the Mailbox databases now on Volume 3 as shown in the following figure:

image

From a Mailbox database point of view nothing has really changed. The first database is located at C:\ExDbs\AMS-MDB01\AMS-MDB01.db and this location has not changed. It is only the underlying volume (mount point) that has changed and this is fully transparant for the Mailbox database.

At this point it is up to the administrator to replace the faulty disk, format it and mount the disk in the C:\ExchVols directory. As you can see there’s no need any more to perform manual reseeds which lowers the administrative burden.

Statistically if you have 10 disks in your Exchange 2013 Mailbox server and you have configured three spare disks it should be sufficient to check once a year for failed disks and replace them accordingly. For a typical Exchange deployment only one spare disk might be sufficient but for large deployments (you should not be surprised this originates from Exchange Online) with maybe thousands of volumes this becomes very interesting

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s