Archive for the ‘Troubleshooting’ Category

Notes on Windows Advanced Troubleshooting

without comments

Last year, I happened to be involved on a series of unusually complex problems that required some advanced diagnostic techniques. During the course of those issues I prepared some notes so that my teams could acquire some sort of independence and have a sense of how these tasks can be accomplished.

Advanced Troubleshooting requires lot of study, patience and dedication. Let’s face it: it’s not easy. But knowing what the tools are and how the process works helps on demystifying this activity. That is why I’ve decided to release my OneNote Notebook on Live.com so that you can link your OneNote 2010 to it or explore it yourself using OneNote WebApp.

OneNoteNotebook 20110504 0 Notes on Windows Advanced Troubleshooting

I would like to warn you that this “discipline” changes a lot as tools, hardware and software evolve. Just take into account that these are my notes from 2010. I will try to keep them updated, but, unfortunately, I can’t promise to. Of course, just in case you you would like to contribute, just let me know. I would love to include your contributions. Maybe we can turn it into a collaborative project!

Continue reading …

Written by Carlos Veira Lorenzo

May 5th, 2011 at 12:00 am

The case of the Zombie Connections

without comments

netstat The case of the Zombie ConnectionsRecently I’ve come across to a weird incident where a very network-demanding application started to show strange behaviors and poor performance. This application consumes lots of sockets, both on the client and the server. Of course, this shouldn’t be an issue, as long as there are lots of solutions out there that have similar requirements.

The case here was that, both the client and the server, kept orphaned/zombie TCP connections in state CLOSING and FIN_WAIT_2 on each side. In other words, no alive Process was the owner of those connections and, as a result, there was no way to close them and free network resources for the application service.

Although the cause of this issue is still under investigation, everything seems to point to some sort of handle leak. Anyway, although interesting, I won’t bother you with the internals of this diagnosis but with a way you can characterize the impact of this issue using PowerShell 2.

Our first problem was, how to do extract meaningful information from more than 25.000 connections from both sides of the communication. If we succeed, we’ll be ready to answer more questions: Could this issue be happening everywhere? In such a case, how quickly could we perform an evaluation in the other servers in the farm?

Continue reading …

Written by Carlos Veira Lorenzo

December 19th, 2010 at 4:29 pm