Disable Automatic Chkdsk on Windows Failover Cluster

These days it is pretty rare to be prompted to run a chkdsk on a Microsoft system, I certainly remember it being a more frequent occurrence on older hardware and OS versions. That being said, there are still occasions where Windows will prompt for a chkdsk and this post deals with a particular event – automatic chkdsk on Microsoft failover cluster volumes.

At work we have many file shares of various sizes, currently there is one which requires a chkdsk and it is 16TB in size. Now I know that Microsoft have made a lot of improvements to chkdsk (i.e –spotfix to repair in seconds) however this volume requires a ‘full’ chkdsk therefore these improvements have no bearing on the matter. This drive was running on a server which we had to ‘blue screen’ to create a full memory dump to aid troubleshooting an issue. This resulted in the volume being flagged for a chkdsk through the volume dirty bit. We know from prior experience that this will take about 30 hours to complete resulting in a long outage for our users. As the volume is a resource in a failover cluster role the chkdsk automatically starts whenever the role moves to a different cluster node. It is possible to kill this task through something like ProcessExplorer, searching for the rundll process that is associated with chkdsk and killing it. That will bring the volume online however a neater solution is to change the automatic chkdsk parameter to prevent it from running and then schedule the work in for an agreed outage.

 

Get Current Chkdsk Setting

First let us check the values on the cluster disks. To do this I ran the PowerShell code below which queries the cluster for all physical disks and then pipes that through the next cmdlet to retrieve the property DiskRunChkDsk and it’s value.

Get-ClusterResource | where {$_.ResourceType -eq "Physical Disk"} | Get-ClusterParameter -Name DiskRunChkDsk

 

PS C:\> Get-ClusterResource | where {$_.ResourceType -eq "Physical Disk"} | Get-ClusterParameter -Name DiskRunChkDsk

Object                        Name                          Value                         Type
------                        ----                          -----                         ----
DATIX Share                   DiskRunChkDsk                 0                             UInt32
Dental                        DiskRunChkDsk                 0                             UInt32
DML                           DiskRunChkDsk                 0                             UInt32
Images                        DiskRunChkDsk                 0                             UInt32
Fax Share                     DiskRunChkDsk                 0                             UInt32
File Services                 DiskRunChkDsk                 4                             UInt32
Lync Share                    DiskRunChkDsk                 0                             UInt32
Optomize                      DiskRunChkDsk                 0                             UInt32
SCCM Share                    DiskRunChkDsk                 0                             UInt32
SQL Share                     DiskRunChkDsk                 0                             UInt32
Users                         DiskRunChkDsk                 0                             UInt32

 

DiskRunChkDsk Parameter Values and Definitions

OK so now we have the value for the DiskRunChkDsk property but what do the values mean? The following list taken from Microsoft documentation indicates the possible cluster disk parameter values along with a description of the impact of each value.

  • DiskRunChkDsk <0x0>:
    • This is the default setting for all Failover Clusters. This policy will check the volume to see if the dirty bit is set and it will perform a Normal check of the file system. The Normal check is similar to running the DIR command at the root. If the dirty bit is set or if the Normal check returns a STATUS_FILE_CORRUPT_ERROR or STATUS_DISK_CORRUPT_ERROR, CHKDSK with be started in Verbose mode (Chkdsk /x /f).
  • DiskRunChkDsk <0x1>:
    • This setting will check the volume to see if the dirty bit is set and it will perform a Verbose check. A verbose check will scan the volume by traversing from the volume root and checking all the files) of the file system. If the dirty bit is set or if the Verbose check returns a STATUS_FILE_CORRUPT_ERROR, CHKDSK with be started in normal mode (Chkdsk /x /f).
  • DiskRunChkDsk <0x2>:
    • This setting will run CHKDSK in Verbose mode (Chkdsk /x /f) on the volume every time it is mounted.
  • DiskRunChkDsk <0x3>:
    • This setting will check the volume to see if the dirty bit is set and it will perform a Normal check of the file system. The Normal check is similar to running the DIR command at the root. If the dirty bit is set or if the Normal check returns a STATUS_DISK_CORRUPT_ERROR, CHKDSK will be started in Verbose mode (Chkdsk /x /f), otherwise CHKDSK will be started in read only mode (Chkdsk without any switches).
  • DiskRunChkDsk <0x4>:
    • This setting doesn’t perform any checks at all.
  • DiskRunChkDsk <0x5>:
    • This setting will check the volume to see if the dirty bit is set and it will perform a Verbose check (scan the volume by traversing from the volume root and checking all the files) of the file system. If a problem is found, CHKDSK will not be started and the volume will not be brought online.

 

Note that one of our file shares (File Services) had already been altered from the default value (0) to a value of 4. This means if the disk moves between cluster nodes and has the dirty volume bit set it will not try and run a chkdsk when coming online.

 

Modify DiskRunChkDsk Parameter Value

I’m going to modify the DiskRunChkDsk parameter value for all of the volumes hosted by this failover cluster to a value of 4 (0x4). This will prevent them from trying to run a chkdsk when coming online. Again I will use PowerShell to achieve this –

Get-ClusterResource | where {$_.ResourceType -eq "Physical Disk"} | Set-ClusterParameter -Name DiskRunChkDsk -Value 4 -Verbose

 

PS C:\> Get-ClusterResource | where {$_.ResourceType -eq "Physical Disk"} | Set-ClusterParameter -Name DiskRunChkDsk -Value 4 -Verbose

WARNING: The properties were stored, but not all changes will take effect until DATIX Share is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until Dental is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until DML is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until Images is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until Fax Share is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until File Services is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until Lync Share is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until Optomize is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until SCCM Share is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until SQL Share is taken offline and then online again.
WARNING: The properties were stored, but not all changes will take effect until Users is taken offline and then online again.

PS C:\>

 

Confirm New Value

Finally we can confirm the DiskRunChkDsk property to see the change – remember the message output above, the setting only comes into affect after the resource has transitioned to offline and then online again. This can be achieved either by taking the role offline manually and bringing it back on or you could just migrate the role to another cluster node. Using the same PowerShell as above we can get the new values –

Get-ClusterResource | where {$_.ResourceType -eq "Physical Disk"} | Get-ClusterParameter -Name DiskRunChkDsk

 

PS C:\> Get-ClusterResource | where {$_.ResourceType -eq "Physical Disk"} | Get-ClusterParameter -Name DiskRunChkDsk

Object                        Name                          Value                         Type
------                        ----                          -----                         ----
DATIX Share                   DiskRunChkDsk                 4                             UInt32
Dental                        DiskRunChkDsk                 4                             UInt32
DML                           DiskRunChkDsk                 4                             UInt32
Images                        DiskRunChkDsk                 4                             UInt32
Fax Share                     DiskRunChkDsk                 4                             UInt32
File Services                 DiskRunChkDsk                 4                             UInt32
Lync Share                    DiskRunChkDsk                 4                             UInt32
Optomize                      DiskRunChkDsk                 4                             UInt32
SCCM Share                    DiskRunChkDsk                 4                             UInt32
SQL Share                     DiskRunChkDsk                 4                             UInt32
Users                         DiskRunChkDsk                 4                             UInt32

PS C:\>

 

 


 

It goes without saying that one should not ignore a chkdsk requirement. We are disabling this option as the cluster roles will move around nodes and between datacenters and therefore without modifying the option we would have long outages while the check ran against the volume. We would prefer to schedule this downtime with the business so we can plan for the outage.

I am very much looking forward to the day when we can start to leverage Microsoft ReFS (Resilient File System) in place of NTFS for all services – this completely does away with chkdsk making this issue a thing of the past. Unfortunately at this current point in time ReFS isn’t quite there yet but I am keeping a close eye on it and hope to see it in use in the near future.

5 thoughts on “Disable Automatic Chkdsk on Windows Failover Cluster”

  1. Thanks for this post. It did help!

    A year ago my failover cluster “failed over” and 24 hours after a chkdsk got it back.

    Well It just did it again about two weeks ago and had to wait 30 hours.

    Just changed the settings and instead of draining roles it had to fail over ( I suspect a server issue) but no chkdsk!

    Thanks again.

    D~!

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.