When a server hangs, it can oftentimes be difficult to find the root cause. There are a variety of potential factors, and narrowing it down can be tough. Here, we'll review the general process for troubleshooting a hung server.
Frequency & Patterns
The troubleshooting process for a one-time hang vs. a repeated occurrence is a little bit different. If a server has locked up more than once recently, it is important to pay attention to any potential patterns. Listing out the exact dates, times, and days of the week the hang occurs is a good start. From there, you can look for any patterns that might indicate a scheduled job is causing the issue. For example, you might see that it happens around the same time of day or certain days of the month. If a pattern is noticed, investigate the server for any scheduled or recurring processes that happen around that time. They will most likely be related to, or the cause of, the issue.
If the server hasn't recently locked up, or if there isn't a specific pattern to the occurrences, then you can move on to the next steps.
Troubleshooting a Hung Server
If after looking for a specific pattern you haven't identified the cause of the hang, it's time to get into really troubleshooting the issue. If this is the first time troubleshooting the hang, reboot the server to get it back online. There isn't much use leaving it in a hung state unless you are looking for specific symptoms later on. There are a few different things that can cause a hung server, including sudden memory exhaustion, a process issue, driver bugs, or hardware failure. Here are the things to check to narrow that list down, in order.
First, review the system and application event logs leading up to the hang. You are looking for two things. One, any errors or warnings that might directly indicate what the underlying issue is. For example, memory exhaustion will sometimes generate entries in the system log, or a slew of disk errors would indicate a drive issue. Second, look for a general idea of what the server was doing when it locked up. Are there log entries that indicate a backup was running or that a piece of software was auto-updating? Bugs in these processes, or additional resource strain caused by them, can lead to a hang.
Second, look into the major applications installed on the server. Review their logs for the same things you were looking for in the main event logs: any possibly related errors and any specific tasks/processes that were running at the time. As an example, a SQL server might have a maintenance job that started shortly before the hang.
Third, review the monitoring system for any performance issues leading up to the lock up. Usually, the server will stop responding to the monitoring system while it's actually locked up so the focus is on the time prior to that. Some things to watch for are RAM utilization increasing, CPU utilization sustained near 100%, or poor disk metrics such as high response times.
Finally, if the root cause still isn't clear, try simply searching the web for the symptoms of the hang and the server role. For example, there have been known issues with the RDS roles causing specific types of hangs in the past. Sometimes you can find these through simple web searches.
If all else fails, analyzing a memory dump of the hung server is the last resort. If it's a VM, you can snapshot it while it's locked up and then convert the snapshot memory to a dmp file. For a physical server, you'd have to ensure that the dump settings are configured correctly, and then force a crash through NMI while it's hung.
Analyzing the dump file is a potentially complex process that won't be covered in-depth here. However, here are some of the basics to look for:
By following all of the above, hopefully the root cause for a server hang can be found. If not, opening a support case is the best next step.