Discussion:
64-bit Processing host instance hangs
(too old to reply)
Svetoslav Vasilev
2008-12-12 10:14:03 UTC
Permalink
Hi, i am investigating a strange behaviour of a 64-bit processing host
isntance in my customers BizTalk server group. There are 3 differnet
environments available at my customers site - TEST, QA and PROD. All of these
have a BizTalk group consisting of 2 BizTalk 2006 R2 servers running on a
64-bit platform. The TEST and QA BizTalk groups are running in VMWare virtual
machines while the PROD one is running on real steel. The SQL tier is
implemented as an active/passive failover cluster on physical nodes where the
TEST and QA are running on 32-bit platform while the PROD is on a 64 bit
platform. In addition the TEST and QA are running on separate instances on
the same physical cluster.
There are in total 8 hosts configured in each group where the major ones for
respectively the receive, send and process functions are duplicated in order
to have both 32-bit and 64-bit versions. These are instatiated on both
servers in the group. In addition there is one host that runs only on one
node, as well as a tracking only host that has instances on both servers.
The strange behaviour is expressed in that the 64-bit processing host
instances(that run orchestrations) in both TEST and QA envs appears to be
hanging (CPU 100%) rather often. The applications in the 3 environments are
the same and for some short periods of time, usually after deployment of the
latest version to PROD, they are fully alike in terms of deployed artifacts.
However this behaviour is not present in the PROD environment.
I have done some debugging of the hanging host instance processes and have
found out that there is a CannotUnloadAppDomainException thrown in one
particular thread. What srucks me is that in each and every occasion I have
been debugging, the thread ID of the thread where the exception is thrown is
always the same. The message in the exception is: Error while unloading
appdomain. (Exception from HRESULT: 0x80131015).
There is one or more occasions of a ThreadAbotException thrown in some other
threads. It looks like those threads are doing some sort of performance
counter updates from what i can interpret from the stack contents. Here is
the call stack:

Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity.UpdateDeltaMACacheRefreshInterval()
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity.RunCacheUpdates(Int32)
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager.RunCacheThread()
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,
System.Threading.ContextCallback, System.Object)
System.Threading.ThreadHelper.ThreadStart()

And here are the stack objects for the thread:

Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
System.Threading.ThreadAbortException
System.Threading.ThreadAbortException
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
System.String
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
System.String
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
System.String
System.String
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
System.String
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager
System.Threading.ThreadAbortException
System.Object[] (System.String[])
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.Threading.ContextCallback
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager
System.Threading.ThreadStart

For me it appears that the thread that is doing the performance counter
update is not stopping when the ThreadAbortException has been thrown thus
leading to the CannotUnloadAppDomainException.
I have configured the CLR Hosting parameters for the min and max count of
worker and IO threads in the system's registry in accordance with the best
practices for each and every host instance in TEST and QA environments. In
addition all the send ports with SOAP and HTTP handlers have been configured
not to resend the message in case a failure occurs.
What I would like to achieve is first of all to identify if this is caused
by the code that has been deployed to the BizTalk group, or by the fact that
it is running in virtual machine context, if it is caused by some throtteling
that occurs (even thoght the traffic is still miniscule), by some
configuration parameter or all the above together. In addition I would like
to get some help on whether there is some way to overcome this issue.
Thank you very much in advance.
Erno Marks
2008-12-19 11:26:02 UTC
Permalink
We are having exactly the same problem here.
Also VMWare, 64-bit, Orchestrations only BizTalk-host going to 100% CPU
after a while.
In our situation, the host seems to go to 100% on fixed intervals: xx:00h,
xx:15h, xx:30h or xx:45h. We have the database backup running at these
intervals, but turning that off did not help.

I am currently investigating a '32-bit' only host, but no definitive results
yet. (Looks good so far.)
Post by Svetoslav Vasilev
Hi, i am investigating a strange behaviour of a 64-bit processing host
isntance in my customers BizTalk server group. There are 3 differnet
environments available at my customers site - TEST, QA and PROD. All of these
have a BizTalk group consisting of 2 BizTalk 2006 R2 servers running on a
64-bit platform. The TEST and QA BizTalk groups are running in VMWare virtual
machines while the PROD one is running on real steel. The SQL tier is
implemented as an active/passive failover cluster on physical nodes where the
TEST and QA are running on 32-bit platform while the PROD is on a 64 bit
platform. In addition the TEST and QA are running on separate instances on
the same physical cluster.
There are in total 8 hosts configured in each group where the major ones for
respectively the receive, send and process functions are duplicated in order
to have both 32-bit and 64-bit versions. These are instatiated on both
servers in the group. In addition there is one host that runs only on one
node, as well as a tracking only host that has instances on both servers.
The strange behaviour is expressed in that the 64-bit processing host
instances(that run orchestrations) in both TEST and QA envs appears to be
hanging (CPU 100%) rather often. The applications in the 3 environments are
the same and for some short periods of time, usually after deployment of the
latest version to PROD, they are fully alike in terms of deployed artifacts.
However this behaviour is not present in the PROD environment.
I have done some debugging of the hanging host instance processes and have
found out that there is a CannotUnloadAppDomainException thrown in one
particular thread. What srucks me is that in each and every occasion I have
been debugging, the thread ID of the thread where the exception is thrown is
always the same. The message in the exception is: Error while unloading
appdomain. (Exception from HRESULT: 0x80131015).
There is one or more occasions of a ThreadAbotException thrown in some other
threads. It looks like those threads are doing some sort of performance
counter updates from what i can interpret from the stack contents. Here is
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity.UpdateDeltaMACacheRefreshInterval()
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity.RunCacheUpdates(Int32)
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager.RunCacheThread()
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,
System.Threading.ContextCallback, System.Object)
System.Threading.ThreadHelper.ThreadStart()
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
System.Threading.ThreadAbortException
System.Threading.ThreadAbortException
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
System.String
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
System.String
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
System.String
System.String
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
System.String
Microsoft.BizTalk.Tracing.Trace+HackTraceProvider
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.String
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager
System.Threading.ThreadAbortException
System.Object[] (System.String[])
Microsoft.BizTalk.MsgBoxPerfCounters.MgmtDbAccessEntity
System.Threading.ContextCallback
Microsoft.BizTalk.MsgBoxPerfCounters.CounterManager
System.Threading.ThreadStart
For me it appears that the thread that is doing the performance counter
update is not stopping when the ThreadAbortException has been thrown thus
leading to the CannotUnloadAppDomainException.
I have configured the CLR Hosting parameters for the min and max count of
worker and IO threads in the system's registry in accordance with the best
practices for each and every host instance in TEST and QA environments. In
addition all the send ports with SOAP and HTTP handlers have been configured
not to resend the message in case a failure occurs.
What I would like to achieve is first of all to identify if this is caused
by the code that has been deployed to the BizTalk group, or by the fact that
it is running in virtual machine context, if it is caused by some throtteling
that occurs (even thoght the traffic is still miniscule), by some
configuration parameter or all the above together. In addition I would like
to get some help on whether there is some way to overcome this issue.
Thank you very much in advance.
Loading...