Analysing and Debugging Memory Dumps (STOP Errors) With WinDBG

For the last 9 years I've been working for Systems Integrator's that have always had Microsoft Premier support contracts. So whenever I've had major server issues, I've just flicked the dumps off to Microsoft for analysis.

Well...a couple of years ago I thought that I'd stop being so lazy and learn how to do it myself. And do you know what? It is so easy to do. We had an issue on a customers site several months ago where their Citrix PS4 servers (Windows 2003) were intermittently blue screening. As part of our build process, a 16MB pagefile is placed on the System drive and the servers are set to provide a Small (mini) memory dump. So we were already getting memory dumps from these blue screens.

I installed the "Debugging Tools for Windows" from here: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

Then ran up WinDbg (pronounced WinDebug).

Go to File > Symbol File Path...

Type in SVR*V:\symbols*http://msdl.microsoft.com/download/symbols

Note: Ensure you set the drive path correctly. In my case I was using V.

This allows WinDbg to download the symbols needed to help analyse the dump.

Select OK

Go to File > Open Crash Dump...

On Windows 2003 servers, the mini crash dumps are found in the %SystemRoot%\Minidump folder, which is U:\Windows\Minidump in my case.

Open the relevant minidump.

Then we get lots of good information in the WinDbg window.

----------------------------Beginning----------------------------------
Microsoft (R) Windows Debugger  Version 6.6.0003.5
Copyright (c) Microsoft Corporation. All rights reserved.

 

Loading Dump File [U:\WINDOWS\Minidump\Mini071006-04.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SVR*V:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Unable to load image ntoskrnl.exe, Win32 error 2
*** WARNING: Unable to verify timestamp for ntoskrnl.exe
*** ERROR: Module load completed but symbols could not be loaded for ntoskrnl.exe
Windows Server 2003 Kernel Version 3790 (Service Pack 1) MP (4 procs) Free x86 compatible
Product: Server, suite: TerminalServer
Kernel base = 0x80800000 PsLoadedModuleList = 0x808af988
Debug session time: Mon Jul 10 16:41:43.015 2006 (GMT+8)
System Uptime: 0 days 0:02:47.656
Unable to load image ntoskrnl.exe, Win32 error 2
*** WARNING: Unable to verify timestamp for ntoskrnl.exe
*** ERROR: Module load completed but symbols could not be loaded for ntoskrnl.exe
Loading Kernel Symbols
.................................................................................................................
Loading User Symbols
Loading unloaded module list
.....
Unable to load image cdm.sys, Win32 error 2
*** WARNING: Unable to verify timestamp for cdm.sys
*** ERROR: Module load completed but symbols could not be loaded for cdm.sys
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 10000050, {f000b1eb, 0, f62b28ec, 2}

***** Kernel symbols are WRONG. Please fix symbols to do analysis.

*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
Probably caused by : cdm.sys ( cdm+78ec )

Followup: MachineOwner
---------------------------------End-----------------------------------------------------

See the line above..."Probably caused by : cdm.sys". It's giving us a hint already :)

Now we need to analyse it. Notice the "1: kd>" in the lower right corner of the debug Window? This is where we type in commands.

In the command window, type !analyze –v

Please note the American spelling of analyse.

This performs an analysis with full verbose display of data, which is used for extracting as much information as possible.

----------------------------Beginning----------------------------------
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except,
it must be protected by a Probe.  Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: f000b1eb, memory referenced.
Arg2: 00000000, value 0 = read operation, 1 = write operation.
Arg3: f62b28ec, If non-zero, the instruction address which referenced the bad memory
                address.
Arg4: 00000002, (reserved)

Debugging Details:
------------------

***** Kernel symbols are WRONG. Please fix symbols to do analysis.

*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************

MODULE_NAME:  cdm

FAULTING_MODULE: 80800000 nt

DEBUG_FLR_IMAGE_TIMESTAMP:  43d682a0

READ_ADDRESS: unable to get nt!MmSpecialPoolStart
unable to get nt!MmSpecialPoolEnd
unable to get nt!MmPoolCodeStart
unable to get nt!MmPoolCodeEnd
 f000b1eb

FAULTING_IP:
cdm+78ec
f62b28ec 8b44817c         mov     eax,[ecx+eax*4+0x7c]

MM_INTERNAL_CODE:  2

CUSTOMER_CRASH_COUNT:  4

DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP

BUGCHECK_STR:  0x50

LAST_CONTROL_TRANSFER:  from f62d5270 to f62b28ec

STACK_TEXT: 
WARNING: Stack unwind information not available. Following frames may be wrong.
f48f3968 f62d5270 887a3008 88952e88 f48f3a9c cdm+0x78ec
f48f3a5c 8083f9d0 8a1f7580 887a3008 887a3008 cdm+0x2a270
f48f3a70 8092e269 f48f3c18 8a1f7568 00000000 nt+0x3f9d0
f48f3b58 80936caa 8a1f7580 00000000 888a0488 nt+0x12e269
f48f3bd8 80936aa5 00000000 f48f3c18 00000040 nt+0x136caa
f48f3c2c 80936f27 00000000 00000000 3000f001 nt+0x136aa5
f48f3ca8 80936ff8 0105f654 00100080 0105f63c nt+0x136f27
f48f3d04 8093d023 0105f654 00100080 0105f63c nt+0x136ff8
f48f3d44 80834d3f 0105f654 00100080 0105f63c nt+0x13d023
f48f3d64 7c82ed54 badb0d00 0105f60c 00000000 nt+0x34d3f
f48f3d68 badb0d00 0105f60c 00000000 00000000 0x7c82ed54
f48f3d6c 0105f60c 00000000 00000000 00000000 0xbadb0d00
f48f3d70 00000000 00000000 00000000 00000000 0x105f60c

 

STACK_COMMAND:  .bugcheck ; kb

FOLLOWUP_IP:
cdm+78ec
f62b28ec 8b44817c         mov     eax,[ecx+eax*4+0x7c]

FAULTING_SOURCE_CODE: 

 

SYMBOL_STACK_INDEX:  0

FOLLOWUP_NAME:  MachineOwner

SYMBOL_NAME:  cdm+78ec

IMAGE_NAME:  cdm.sys

BUCKET_ID:  WRONG_SYMBOLS

Followup: MachineOwner
---------------------------------End-----------------------------------------------------

What this does, is confirm that the cdm.sys driver is the cause of the blue screens.

Type q in the command window to quit.

So now I just had to do some research on the cdm.sys driver. From experience I know that this is a Citrix driver used for the client drive mapping process, but if you Google it, you will find that information anyway. So I then went searching through the Citrix KB and Forums, and found the following hotfix.

Hotfix PSE400R01W2K3064 - For Citrix Presentation Server 4.0 for Windows Server 2003

Three of the listed fixes are:

31. Servers may experience a fatal error, displaying a blue screen on CDM.sys during heavy utilization. The issue is found when the driver verifier is being used.
42. Servers experience a fatal error, displaying a blue screen on CDM.sys. This occurs when an application is accessing drive A in a session using "A:" rather than "A:\" or if the application is the first process to access a client drive in a session.
45. Servers are trapping in CDM.sys with the following STOP error message:
- DRIVER_VERIFIER_DETECTED_VIOLATION (c4)

This patch was deployed immediately, and there have been no more blue screens since. So I resolved in 2 hours what could have turned out to be a 24 to 48 hour process going through Microsoft Premier Support. The customer was chuffed, and so was I.

If you want to play around and learn how to do this, you can get a program called "Not My Fault", which was developed by Mark Russinovich formerly of Winternals. As you can see in the screenshot below, it can create some serious issues for you to practice on.

Get it from here: http://swatrant.blogspot.com/2005/12/notmyfault-fault-maker.html

For more information, please refer to the following helpful presentations and articles:

I hope this info is not only helpful, but gives you confidence to tackle these issues yourself.

 


    9th April 2007