Wednesday, June 11, 2008

Back to the Future...

Two weeks ago I had one of the scariest “interruptions” that I’ve ever experienced in my 10 years of working as a professional geek. Let me take you back to two Saturdays ago…


Our building is on a pretty sketchy power grid and although we have a beefy UPS, we don’t have a generator backup. At about 2:00pm a nasty storm rolled and I was relieved that my son’s soccer game was cancelled. At 2:30ish, my phone rang; it was my boss telling me that the building had lost power. I’m the closest engineer to the building so I grabbed my laptop and headed into the office. We run a series of scripts that power down various systems to shed some load on UPS (this helps keep the core systems up longer). About 20 minutes later I get into the datacenter and try to access the server hosting the scripts via the Raritan console. As soon as I hit ctrl-alt-del, all power went out! No lights…no whirring sounds…no alarms…nothing. The UPS was cached. Literally a minute later, building power came back on and systems started coming back up. That’s when the adventure really began…


I went back to my desk so that I could use my workstation to monitor the systems as they came back online. One of my first tasks was to make sure mail was flowing again so I tried to log into one of our Exchange servers. My logon attempts failed and the error message indicated Kerberos problems. The event logs listed the following error:


I logged in locally to check the system time and everything looked ok.

I was able to successfully log into one of my domain controllers and immediately noticed the system time. It had changed to 8:45pm, February 28, 2002. Awhile ago I had reconfigured our NTP service to synchronize with one of our routers (at the request of our network manager)…I knew that something had to have gone wrong with that routers time. To correct the issue, I pointed my forest root PDC emulator to point to the US Naval Observatory’s ntp servers and forced a rediscover (w32tm /resync /rediscover) and the time corrected itself. I then sync’d the time on all of my other DCs.






Problem solved right? Wrong! I was able to get the Exchange stores to mount, but found another stomach turning problem. When I looked at the Directory Services logs, I saw nothing but red:













All of my domain controllers had been tombstoned due to the time changes! Fortunately the fix was clearly listed in the error message. I couldn’t demote and promote all of my DCs so I took step three listed below:

Event Type: Error

Event Source: NTDS Replication

Event Category: Replication

Event ID: 2042

Date: 5/31/2008

Time: 4:45:23 PM

User: NT AUTHORITY\ANONYMOUS LOGON

Computer: XXXXXXX

Description:

It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

The reason that replication is not allowed to continue is that the two machine's views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.

Time of last successful replication:

2002-02-28 20:11:01

Invocation ID of source:

0c86f6c8-f6b8-0c86-0100-000000000000

Name of source:

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Tombstone lifetime (days):

60

The replication operation has failed.

User Action:

Determine which of the two machines was disconnected from the forest and is now out of date. You have three options:

1. Demote or reinstall the machine(s) that were disconnected.

2. Use the "repadmin /removelingeringobjects" tool to remove inconsistent deleted objects and then resume replication.

3. Resume replication. Inconsistent deleted objects may be introduced. You can continue replication by using the following registry key. Once the systems replicate once, it is recommended that you remove the key to reinstate the protection.

Registry Key:

HKLM\System\CurrentControlSet\Services\NTDS\Parameters\Allow Replication With Divergent and Corrupt Partner

Creating that registry key allowed replication to resume and later that evening I disabled the setting via my new best friend: GPO Preferences!