Recently at one client location we had numerous unexplained cluster failovers. I had SQL Server and Analysis Services running on the clusters. When analysis was done on the failover, the analysts found that the “FlightRecorder” showed errors immediately before the failover. The Flight Recorder got the blame for the failover. When i was brought in, i took a closer look at the log and i noted serveral things:
1. The core issue was that the Flight Recorder’s drive ‘disappeared’ from the cluster before the Flight Recorder error.
2. The SQL Server also complained about the lack of a drive…but it was slower to complain than Flight Recorder.
3. Prior to the Flight Recorder error, the “Node Mgr” took one of the nodes out of the cluster. WHy? Because a drive disappeared.
When i first investigated this issue, i focused on the SQL error logs…because I have extensive experience with them…and they have generally served my purposes. This time, because the clusters are virtual i found better evidence in the Cluster event logs. In fact in the “Cluster Event” logs i also found all that i was used to finding in the SQL logs. In addition, i found the logs faster (by far) the SQL Logs. Also they are much easier to search thru, and you’ll find a wider filter capability. Lastly you can save your search criteria. These are all reasons to switch to the new, native Windows’ event logs when on a new, modern box.
Learning point: If you’re on a modern cluster…seek out the “Cluster Event Logs” and do your research there. You’ll save time and have an easier experience doing so. (Modern, here, is defined as Windows 2008 or newer.)