As many of us know, Active Directory replication problems come in all shapes and sizes. It’s particularly confounding when a DC just won’t replicate….anything….whatsoever. No inbound replication, no outbound replication, not one Naming Context coming or going. One of the scenarios we’ve seen a fair amount of here in support is when this happens following the restoration of a physical or virtual DC resulting in a state called USN rollback. If you aren’t familiar with USN rollback, in this post we’ll give you a better idea of what it is, helping ruling it in or out, and how to fix it when it does happen.
What is USN rollback (at a high level)?
First let’s define a USN. USN stands for update sequence number. Put simply, USN’s are how Active Directory keeps track of replication. Let’s use the example of two DC’s that are replication partners. When an object is changed on a given domain controller – let’s say a server named DC1 – the USN is iterated which tells DC1 a change needs to be replicated to its replication partners. Put differently, AD says “Hey DC1, DC2 has a change it needs to pick up.” This is essentially a concept called an “Up-to-Dateness vector.” On the other end, DC2 has what’s called a USN “high-watermark” for DC1 which keeps track of the most recent change it has received from DC1. This is a second measure to help DC2 determine whether or not it needs to pull any updates from DC1. All of this is to say that DC1 keeps track of what has been picked up and what needs to be picked up, and so does DC2. Everyone is keeping track, which is a good thing with changes taking place everywhere all the time.
What happens then when DC1 and DC2 have a conversation and DC1 says “You need to pick up change 0004” and DC2 replies “That’s funny, because I’m showing your last change as 0005. Why would you send something you already sent? You know what, we’re breaking up!” Dramatic Active Directory lover’s quarrel aside, in this scenario DC1 is in USN rollback because it is telling DC2 to come get an update DC2 already has. Instead of enduring the havoc that could ensue, the Active Directory code wisely chooses to just shut down replication on DC1 until someone can intervene. For better or worse, if you’re reading this then that someone is you.
How to tell if you have it.
There are a variety of indicators, all of which can let you know if the server is in a rollback state. The more of these you see, the more you can suspect it to be the case.
• The server (or the AD database) has been recently restored or a virtual DC reverted from snapshot – This doesn’t just happen on its own. An action on the part of an administrator is required for USN rollback to even be considered as a possible cause. Otherwise, it’s more likely some other AD replication issue.
• The Netlogon service is Paused – This is pretty rare with the exception of USN rollback.
• Inbound & Outbound replication disabled – Check this by running “repadmin /showreps” from an elevated command prompt.
• If HKLM\System\CurrentControlSet\Services\NTDS\Parameters\DSA not Writable is set to 4 – Also not likely to happen outside USN rollback scenarios.
• Directory Services Events – Look for the following events in the Directory Services log: 2095, 1113, 1115. Events can have a great many causes and are a great way of tracking down replication problems as a whole.
• Repadmin showutdvec output – Run “repadmin /showutdvec DC1 dc=domain,dc=com” on DC1. Run “repadmin /showutdvec DC2 dc=domain,dc=com” on DC2. If the replication partner has a higher USN value than the DC has for itself, it could indicate a problem.
Here’s an example of the repadmin showutdvec command at work:
Output from server 2008DC
Output from server 2008DC2
Notice that 2008 DC has a lower USN value for itself. In this instance, the server actually was not in USN rollback. Instead it was having issues with secure channel. Once I fixed the issue, 2008DC had a higher USN for itself. This illustrates 2 important points about this particular test:
1. I don’t recommend using this command as the single litmus test for USN rollback. The bigger the gap, the better an indicator it is, but I’d still use it in combination with some of these other signs. As with the example above, if there are no other USN rollback symptoms then a replication partner holding a higher USN might simply indicate a garden-variety AD replication issue.
2. This command helps identify which DC might be the problem in a scenario when replication is broken.
How to fix it.
You’ve done your homework and made the determination a domain controller has been rolled back. Now what? Here are some options.
1. Remove the DC. The number one solution per http://support.microsoft.com/kb/875495 and every other blog, forum post, and wiki entry since this KB was written is to remove the problem DC. If removal is an option in your environment I highly recommend this course of action. Admittedly doing so is easier said than done since the domain controller in question won’t, by definition…ahem…..replicate. Thus we are stuck doing a metadata cleanup. To some this may seem rash, but don’t be nervous: if most Escalations Engineers had a nickel for every metadata cleanup we’ve done in our time, we’d be having lunch together at Ruth’s Chris. While I won’t walk through the process here in detail, this is an article that shows a couple different ways of getting this accomplished – http://technet.microsoft.com/en-us/library/cc816907(v=WS.10).aspx. One of the tactical keys to going this route is that the DC in question isn’t running other services because following the metadata cleanup the server should be taken offline.
2. Restore from a good backup. Since this a domain controller, the backup must include the system state. More on why this works in the next section. If you can pinpoint a restore point from prior to rollback, this can fix your issue.
3. Sometimes you are down to 1 DC and have no good backup. Getting to a single DC typically happens because a well-meaning admin (or counterpart in escalations) has already removed the other DC/DC’s involved in the process of troubleshooting. Not having a good backup can happen when either the server has been in this state for a such a prolonged period of time that it has outlasted the backup rotation, or the IT department or IT admin is simply allergic to backing up their systems, which can often result in what I like to call a “resume generating event.” However we got to this point, it’s lucky for you there are still options.
First, as crazy as this sounds it might be easy to assume we don’t need to fix it. After all, why do I need to worry about replication with a single DC? It doesn’t have any replication partners! It’s highly recommended to have at least 2 DC’s for the sake of redundancy, so it would be wise to add a second one at some point. Can’t do that if the first DC isn’t able to replicate. Another consideration is this may be the only DC in the domain, but if there are other domains in the forest it will certainly need to replicate with them. Finally there are other services that just plain aren’t going to function normally (how well do you think the netlogon service works in a Paused state???) until we get this worked out.So without further delay, here’s what to do:
• Reboot the DC into DSRM. If you aren’t familiar with this process, read more about it here – http://technet.microsoft.com/en-us/library/cc816897(v=WS.10).aspx
• Open Regedit and navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters
• Delete the registry value “DSA Not Writable” or set the value to 1
• Next, right-click the Parameters key, click New, and then click DWORD (32-bit) Value.
• Type the new name “Database restored from backup”, and then press ENTER. Double-click the value that you just created to open the Edit DWORD (32-bit) Value dialog box, and then type 1 in the Value data box. (This is more for the previously mentioned multi-domain forest scenario.)
• Restart the domain controller in normal mode.
• Enable the Inbound and Outbound replication by opening a command prompt using Run As Administrator & run the commands below:
Repadmin /options –disable_inbound_repl
Repadmin /options –disable_outbound_repl
• Force the replication
Repadmin /syncall /AeP (Again, more for the previously mentioned multi-domain forest scenario since in a single-domain forest there wouldn’t be anything to replicate with.)
How to avoid it.
Restore backups correctly:
Given what you now know, it would be easy to wonder “why doesn’t this happen when we restore from backup?” After all, we’d be restoring back to a point where the server holds “old” USN metadata. The key difference is the Invocation ID. This is essentially the instantiation number of the AD database. Windows Server Backup automatically resets it when it does a restore, which is why all goes smoothly with WSB restores of a DC. The Invocation ID can also be manually changed using the “Database restored from backup” key from the procedure above. When the Invocation ID changes on a given DC, this tells its replication partners “I’ve been restored from backup. Can you fill me in with everything that’s happened since?” Instead of detecting a rollback, the replication partners just update the restored DC will all changes and life for the DC begins anew. It stands to reason then that any backup software that isn’t AD-aware and/or doesn’t do this properly could cause a rollback. It could even happen with Windows Server Backup if used incorrectly. We’ve seen instances in support where an admin will simply try to restore the ntds.dit (which could actually work if done properly.) In the end, it’s far easier to let Windows Server Backup restore the system state and do the work for you.
With virtual DC’s, be careful with snapshots:
Whenever it was that virtualization snapshots were invented and the first virtual domain controller came into being, I suspect the first USN rollback happened about 15 minutes later. Admittedly I have absolutely no data to back that up. The point is, take good care with virtual DC’s and snapshotting. Here are some guidelines.
• If you are running a Server 2012 Hyper-V host or newer and the VM is 2012 or newer, you will be fine. For more on this read here – http://technet.microsoft.com/en-us/library/hh831734.aspx
• If you are running a Server 2008R2 Hyper-V host, do not restore a virtual DC from snapshot.
• If you are running a Server 2012 Hyper-V host or newer and the VM is 2008R2 or older, do not restore from snapshot.
• If you are running Vmware, switch to Hyper-V. I kid. Calm down VMWare zealots; I was just checking to see if you were still reading. Make sure the VM is Server 2012 or newer and use the version of VMWare advised in this article and others – http://blogs.technet.com/b/keithmayer/archive/2012/08/06/safely-cloning-an-active-directory-domain-controller-with-windows-server-2012-step-by-step-ws2012-hyperv-itpro-vmware.aspx
Hopefully this gives you an additional tool to put in your toolbox. Just be careful with it. If knowing how deal with this particular issue is a hammer, not every AD replication problem is a nail. Best of luck and I hope this helps.