Content freshness is a fairly key concept in the life of an Active Directory/DFS guru. Most of us have seen this in action with AD replication, for example. If two DC’s aren’t replicating then at
best the AD objects & attributes are out of sync and at
worst replication will shut down completely when the disconnect continues beyond the tombstone lifetime. Systematically determining which changes should be applied becomes impossible or even problematic (think: lingering objects in this example) the longer the timeline. The fundamental idea is that directory data needs to stay fresh. If it’s been too long since the data has been synchronized it’s best to just discontinue replication. The same is true for DFS.
Let’s say there are two servers in a Replication Group called Server1 and Server2. When a DFS Replication group falls out of sync, changes are still being made to the files on both servers. So how do we keep track of them? In the short run, those changes can be queued up in the Staging Area of both servers and will be replicated when the communications issue is past. But what if that outage is prolonged? You could have a document on Server 1 where the file is deleted, the deletion doesn’t replicate, and meanwhile on Server2 edits continue to be made to that file. What’s DFS gonna do if the members of the DFS Replication group actually do see one another again? Does this sound familiar? It’s like the DFS version of a lingering object! In other words, content freshness is a critical aspect of replication. If the content gets stale enough – like in this scenario – you actually don’t want it to replicate. In response Microsoft’s code sets replication thresholds and shuts it down for you. In so doing, they’ve saved us from ourselves. For an even deeper dive on this, and put much more eloquently than I ever could, here is a post from the official DS Team blog on the subject – http://blogs.technet.com/b/askds/archive/2009/11/18/implementing-content-freshness-protection-in-dfsr.aspx
So what do we do if we find ourselves in this predicament???? That’s probably the reason you’re here at our blog. You’ve likely seen an event 4004 in the DFSR log that shows an error 9098 and makes the statement “A tombstoned content set deletion has been scheduled.” Content freshness is the reason this has occurred. AFTER fixing the root cause of the blockage (firewalls, DFSR in dirty shutdown, etc.) here are some options on how to get things back up and running:
Solution1: Follow the Recovery Procedure in the aforementioned DS blog post.
The specifics are already in the link. However, at a high level this essentially boils down to:
- Get a backup
- Disabling the affected member in the RG
- Force AD replication
- Enable DFS Replication again on the member.
- Force AD replication.
- The node will then go through initial sync again where it catalogs the files in the jet, so you’ll need to wait for that to complete before letting users back in.
As further evidence that this is a good course to follow, if you happen to see a 4012 in the DFSR log it suggests the very same. Picture below.
Solution2: Recreate the Replication Group. Best to consider this a second option to Solution1.
- Get a backup of the Replicated Folder on both/all nodes.
- In the DFS Console, delete the Replication Group
- Force AD Replication.
- From an elevated command prompt run DFSRDiag pollad on each of the former Replication Group members.
On each (former)member of the RG, delete the hidden DFSRPrivate folder.
- This holds the Staging and ConflictAndDeleted folders. You may want/need to restore some of this from backup after all is said and done if the authoritative server’s copy of the file isn’t actually the one you want.
- Where is this???? It’s in the Replicated folder. If the Replicated Folder is e:\homes, then from a command prompt you can see it from e:\homes\dfsrprivate. You’ll need to unhide system files in order to see it in Windows Explorer.
- Recreate the Replication Group in the DFSR Console.
- DFSRDIAG pollad from each RG member, so it picks up the new configuration from Active Directory.
- This will kick off an initial sync on the Replication Group, which is going to take some time. You won’t want to let the users back in until this is complete. Keeping checking the DFSR log for a 4104 which indicates this is finished.
- As the event suggests, check the PreExisting & ConflictAndDeleted folders for any fallout and don’t be afraid to check the backups for a more relevant version of files from the old Staging folders.
Hopefully this help fill in your understanding of content freshness and why in certain circumstances the Microsoft code is actually built to halt replication. Getting out of this jam isn’t perfect and often data can be lost. However this builds the perfect business case for why you need a tool to help you monitor DFSR.