The day the Exchange cluster died

by Shijaz Abdulla on 24.09.2007 at 08:48

I installed Windows Server 2003 Service Pack 2 on a client’s Exchange Server 2003 cluster on Thursday night (Yeah, I hear you – what a way to spend a weekend!). Everything went well, installation completed, rebooted and everything was happy and kicking.

…until on Friday morning when the Exchange HTTP Virtual Server Instance failed. Since this resource was configured to ‘affect the group’, the failure forced a failover of the whole Exchange cluster group to the passive node.

Within no time, Exchange HTTP Virtual Server Instance failed again, this time on the passive node! Someone press the Panic button!! The initial understanding of the situation was clear – Installation of Windows Server 2003 Service Pack 2 brought the mighty Exchange cluster to its knees.

I rebooted both nodes and normal operation ensued. But after a couple of hours it happened again. In the event logs, I could see things like:

Event Type: Warning
Event Source: MSExchangeIS Mailbox Store
Event Category: General
Event ID: 1115
Description:
Error 0xfffffbbe returned from closing database table, called from function JTAB_BASE::EcCloseTable on table DeletedFolders. For more information, click http://www.microsoft.com/contentredirect.asp.

Event Type: Error
Event Source: MSExchangeCluster
Event Category: Services
Event ID: 1005
Description: Exchange HTTP Virtual Server Instance 100 (servername): The IsAlive check for this resource failed. For more information, click http://www.microsoft.com/contentredirect.asp.

Event Type: Error
Event Source: Srv
Event Category: None
Event ID: 2019
Description: The server was unable to allocate from the system nonpaged pool because the pool was empty. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

I couldn’t find much on these errors on the Internet, and this is the reason for this post. Here’s what the problem is.

My client is running Windows Server 2003 on a 32 bit server. 32-bit versions of Windows, as we all know, support a maximum of 4 GB RAM. By default, Windows slices the total memory right down the middle: 2 GB is reserved for the OS and 2 GB for the applications. Out of the 2 GB reserved for the OS, 256 MB is reserved for non-paged pool memory.

My client is using the /3GB switch, which forces Windows to limit itself to 1 GB RAM and let the applications use 3 GB. But this causes the non-paged pool memory reservation to be reduced to 128MB instead of 256MB.

Now, 128 MB is a tight little space. IIS uses non paged pool memory for processing requests. On Windows Server 2003 and Windows Vista, IIS stops processing requests once the available non-paged pool memory goes below 20 MB. Event 2019 is evidence for that.

Of course you know, Exchange relies heavily on IIS. So that explains why the Exchange HTTP Virtual Server resource went down! But wait – what’s hogging up the non-paged pool memory? And how do we fix this?

That’s when Microsoft sent in their Poolmon utility, that grabs information on whats in there. The culprit? – Broadcom’s NetXtreme II network card driver! It was incompatible with scalable networking features bundled with Windows Server 2003 SP2 (and the Windows Scalable Networking Pack) and caused a memory leak! I disabled the TCP Chimney with the following command:

Netsh int ip set chimney DISABLED

I also disabled the registry key HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParametersEnableTCPA registry value setting by it to zero on both nodes and other steps mentioned in KB936594. That was all it took to solve the problem!

See my earlier related post: Delayed Logins: Change Password feature in ISA 2006