SCSM A monitoring host is unresponsive or has crashed Error 4000

Author by Nathan Lasnoski

I was recently working on a Service Manager 2012 deployment where the health service was regularly crashing approximately 3 - 5 minutes after it was started.  This resulted in most workflows firing for a minute or two, then crashing.  This also prevented the connectors from completing, notifications from firing, and literally any work process from functioning properly.  This is because the Health Service really could be called the "workflow service", making it responsible for almost every backend process in Service Manager.   The error in the event log was "A monitoring host is unresponsive or has crashed", with an error ID of 4000.  In reviewing the error I found references to SCOM and a SNMP management pack.  I proceeded to remove the management pack, disable many workflows, and remove changes.  Finally, we rolled back to an earlier state, since we noticed this had been happening for some time.  After the roll-back we found the processes started working properly, but then about 8 hours later, we saw the Health Service crashing again.   I then proceeded to look further at the workflows using the SQL queries that Travis blogged about, and which were noted on other Microsoft blogs.  Here are some references:   In particular, we ran the following SQL command: " DECLARE @MaxState INT, @MaxStateDate Datetime, @Delta INT, @Language nvarchar(3) SET @Delta = 0 SET @Language = 'ENU' SET @MaxState = ( SELECT MAX(EntityTransactionLogId) FROM EntityChangeLog WITH(NOLOCK) ) SET @MaxStateDate = ( SELECT TimeAdded FROM EntityTransactionLog WHERE EntityTransactionLogId = @MaxState )   SELECT LT.LTValue AS 'Display Name', S.State AS 'Current Workflow Watermark', @MaxState AS 'Current Transaction Log Watermark', DATEDIFF(mi,(SELECT TimeAdded FROM EntityTransactionLog WITH(NOLOCK) WHERE EntityTransactionLogId = S.State), @MaxStateDate) AS 'Minutes Behind', S.EventCount, S.LastNonZeroEventCount, R.RuleName AS 'MP Rule Name', MT.TypeName AS 'Source Class Name', S.LastModified AS 'Rule Last Modified', S.IsPeriodicQueryEvent AS 'Is Periodic Query Subscription', --Note: 1 means it is a periodic query subscription R.RuleEnabled AS 'Rule Enabled', -- Note: 4 means the rule is enabled R.RuleID   FROM CmdbInstanceSubscriptionState AS S WITH(NOLOCK) LEFT OUTER JOIN Rules AS R ON S.RuleId = R.RuleId LEFT OUTER JOIN ManagedType AS MT ON S.TypeId = MT.ManagedTypeId LEFT OUTER JOIN LocalizedText AS LT ON R.RuleId = LT.MPElementId WHERE S.State <= @MaxState - @Delta AND R.RuleEnabled <> 0 AND LT.LTStringType = 1 AND LT.LanguageCode = @Language AND S.IsPeriodicQueryEvent = 0 /*Note: Uncomment this line and use this optional criteria if you want to look at a specific workflow that you know the display name of*/ --AND LT.LTValue  LIKE '%Test%' ORDER BY S.State Asc " This returned that one paticular workflow in paticular was over 10,000 minutes behind!   We then disabled that workflow, after which the health service stopped crashing.   If you are running into this error with the Health Service, this query was super helpful. Have a great day!   Nathan Lasnoski

Nathan Lasnoski

Chief Technology Officer