Availability Groups contain a lot of moving parts. Planning for them is an involved process, and deploying them involves many steps. However, the real work with AGs begins once you start using them every day. They need to be watched to ensure they are running and performing optimally. So, what are the most important metrics to monitor? There are also a lot of tools for monitoring. Which are the best for AGs?
The three main categories to monitor are replica status, Availability Group status, and database status.
Tools that can be used include T-SQL, the AG Dashboard in SSMS, Performance Monitor, SQL Server Agent Alerts, System Center Management Pack for SQL Server, Extended Events, and Policy-Based Management.
Each AG is comprised of two or more replica servers. It’s important to know if all servers are running, and if the SQL Server services are running on them. Questions to ask:
- Is the server running?
- Is the SQL Server service for each instance running?
- Is the SQL Server Agent service for each instance running?
The tools most frequently used for this are outside of SQL Server – enterprise monitoring tools that check server and services health are used.
The main components of the AG should be monitored: replicas, databases, and the listener. Questions to ask about availability:
- Are all databases online?
- Is data synchronizing between replicas?
- Is the listener available?
There are notifications it's ideal to receive, to be proactive:
- When the status of a replica changes.
- When the status of a database changes.
- When data transfer slows under an acceptable threshold.
- When restoring data slows under an acceptable threshold.
There are many tools that can be used for these tasks – primarily System Center, AlwaysOn Dashboard, AlwaysOn Health XE session, and SQL Server Agent Alerts.
If your organization is using System Center, take advantage of the Management Pack for SQL Server. There is a specific category for AlwaysOn Monitoring included, which covers dashboards, health status, and automatic discovery of AGs.
The Dashboard is built into SSMS and is a great at-a-glance tool for the status of one AG. To access it, open SSMS, connect to the replica, expand AlwaysOn High Availability, expand Availability Groups, right-click the AG name, and select Show Dashboard. The basic view is informative, but can be expanded at the replica or database level by clicking the Add/Remove Columns option.
Recommended selections at the replica level are Availability Mode and Quorum Votes. Recommended selections at the database level are Log Send Queue Size (KB), Log Send Rate (KB/sec), Redo Queue Size (KB), Redo Rate (KB/sec), and, if you have async secondaries, Estimated Data Loss (time).
AlwaysOn Health Extended Events Session
This is a built-in Extended Events session for Availability Groups. To view the information, expand Management, expand Extended Events, expand Sessions, right-click AlwaysOn_health, and select Watch Live Data. Then, use “Choose Columns” to select more information. The events captured by this session are:
- Alwayson_ddl_executed – CREATE, ALTER, or DROP is issued.
- availability_group_lease_expired - Occurs when there is a connectivity issue between the cluster and the Availability Group resulting in a failure to renew the lease.
- availability_replica_automatic_failover_validation – Occurs when the failover validates the readiness of replica as a primary. For instance, the failover validation will return false when not all databases are synchronized or not joined.
- availability_replica_manager_state_change - Occurs when the state of the Availability Replica Manager has changed. When the availability replica manager is not in healthy state, all availability replicas in the SQL Server instance will be down.
- availability_replica_state_change - Occurs when the state of the Availability Replica has changed. Monitor this carefully on all replicas.
- lock_redo_blocked - Occurs when the redo thread blocks when trying to acquire a lock.
- Errors specific to Availability Groups.
The Extended Events session is most useful after something unexpected has happened. It documents the state of replicas before and after failovers, and captures errors, so it is easier to see when something went wrong, and why.
SQL Server Agent Alerts
Several Agent Alerts can and should be set up. These can notify you when something changes – reactive – or when a threshold has been reached – so you can be proactive.
The error numbers I find useful to alert on are 1480 - AG Role Change (failover); 35264 - AG Data Movement – Suspended; and 35265 - AG Data Movement – Resumed.
You can also set up Alerts to be raised when performance counters rise above an acceptable level. There is no magic number for what the level is – you must have a baseline of performance in your AG. The object and counters to monitor:
Object: Database replica
Counter: Log Send Queue. Measured in KB. On primary, amount of log records not yet sent to secondary. Set to alert you when the threshold has gone above a certain number.
Counter: Recovery Queue. Measured in KB. On secondary, amount of log records not yet applied to database. Set to alert you when the threshold has gone above a certain number.
Since the entire database – and all changes to it – are being synchronized between replicas, it’s important to know the rate of changes in each database, and if they are being successfully replicated.Questions to ask about performance:
- What is the speed of data transfer between replicas? (Log send queue)
- What is the speed of data restoring on the replicas? (Redo queue)
An excellent explanation of the data synchronization process can be found in the article “Monitor Performance for AlwaysOn Availability Groups”. The examples this article provides show using Policy-Based Management and Extended Events for monitoring.
Successfully Monitoring AGs
There is no one tool to cover every aspect of your AGs both proactively and reactively, but by combining tools you can see the overall picture. Determine what is most important to your organization to monitor, then review the tools available. Pick those that will give you the most information, and require the least changes to your current monitoring routine.
Being proactive in monitoring pays huge dividends!