Debugging: Through the looking glass

Update 10-Jan-2008: I have added a link to a full trace log to one of our service calls to demonstrate the type and quality of information you should be logging...

One of the biggest areas for improvement in software development is debugging - or put another way, reducing the time to understand and fix bugs - lets call this Time To Fix (TTF).

Reducing TTF has some benefits directly related to "Business Agility"- more bugs can be tackled and fixed in a given period or a set threshold (severity) of bugs can be fixed in a shorter time. Rather nicely the means to improve TTF actually reduces the number of bugs being produced over time - we'll cover how this works later on.

Objective
Drawing on my "Business Agility through Component Software" concept and from hands on development these past ~three years I have experienced first hand how inefficient bug fixing can be and how enormous time savings can be made with a better approach to designing and implementing your software. I want to demonstrate practical changes you could make to your software in order to increase your software quality and reduce your TTF.

"Through the looking glass"
Read this in order to set the scene on the type of software and environment that has spawned this article.

One of the biggest challenges faced by developers is when their software is deployed and testing starts. Once the software is installed outside of a development environment it is on an "operational" footing - you rarely have the luxury of any IDE or serious debugging tools as they are considered a risk to the platform or environment. Take away the IDE and dev tools and you are left with an enormous hulk of code that often does not provide a clue why it is broken and offers a tiny spyhole into why it has broken.

"Software not only has a responsibility to work but also to provide information on why it doesn't when it fails."

Marvellous idea eh? It's nothing new obviously - developers have been writing exceptions and debugging information to logfiles, event log etc for a long time and many libraries exist to help facilitate this (Enterprise Library, Log4Net).

However my experience is that the quality of information logged is usually very poor, generally very cryptic and almost no help in finding the bug. Why would this be the case though? Well lets start with the notion that the code even has bugs - preposterous! So why should I worry about the quality of information as exceptions just don't happen! Widget.cs, Line 450 - "NullReferenceException - Object reference not set to an instance of an object" is just not good enough to actually diagnose and solve a bug is it really?

The next issue is that often a condition or code path goes wrong long before an exception is actually thrown. Providing a stack trace gives you a pretty clear idea of where the exception occurred but many is the time I've seen heads scratched and the heard the words "...it shouldn't even be in this code..." uttered.

Developers when writing software generally don't think from an operational perspective - I think this is a key reason for the poor information available when trying to fix a bug. Information about what the code is actually doing is vital to quickly diagnosing problems. By this I mean providing a "trace" through the code as it executes. Each trace event should contain information such as ....

Instance executing (vital in web environments) - usually a guid. This will help isolate a trace when stored in a common format or repository
Location information (service name, assembly, class)
Point in time (best recorded as UTC for global applications)
Message/description of event

Here is a sample of one of our event traces (its been sanitised) to demonstrate the type and quantity of information that I am talking about. This trace is from a single call to one of our service stacks (Windows media licence request). The important thing to notice is that it records variable (data) values - its often the value of a specific variable that alters the course of the code so it is important to know what these values are especially if you hook logic to them. Also notice that the trace contains performance information about the duration of some of the operations - this is the subject of my next post.

Clearly this is can get very very verbose. Plus, the code will be littered with "trace" statements. There are Aspect Orientated Programming (AOP) techniques to transparently generate "Entering Method X" and "Leaving Method X" trace statements however knowing where you have been often doesn't tell you why - AOP tracing is not the answer by itself as it solves one issue (readability) but offers no more real information than a stack trace.

"Fundamental to tracing is recording the logic flow within a method"

My answer to tracing statements "cluttering" the code is simply that it is for the greater good.

What about all those trace statements, the amount of information output and impact on performance?

One thing I will say about the tracing/logging frameworks I have ever encountered - they are often too complicated especially when it comes to "levels" (info, warning, error, critical etc). This plethora of levels leads to many problems...

Confusion as to the definition of each level
Coding pace slows as developers struggle with applying these levels
Inconsistent application of levels to events (developers get sloppy and cut and paste statements and forget to adjust the level - honest, it does happen!)

So to answer the performance question...my experience is that debugging/auditing can usually be simplified such...

There are only two audit levels - "debug" and "business". A debug event is part of the verbose trace used to diagnose a bug and a business event is usually something of significance that must be recorded - for instance, record a modification to value.
The debug events are only enabled when required to investigate a problem. Performance is only impacted for the minimum period necessary.
The quantity of information is not a problem if you can isolate the information you need to solve the problem.
Use "rolling" event sinks such as the event log debugview or rolling date/size limited files

So it all seems very clear - providing more information and of a better quality will to help diagnose and pinpoint the bug. This will clearly help reduce our TTF.

The developer can also directly help by explicitly seeking out exception conditions before they blow up in an "uncontrolled" manner - eg: your typical "NullReferenceException". I am talking about pre and post conditional code checks. All code be it an assembly, class or method operates within a set of limits - these may be datatype limits or conceptual limits (eg: a uk traffic light has three colours, you cannot set it any colour other than red, amber or green). Without "spoiling" my future post on Behaviour and Interface driven design these pre & post conditional checks should check the inputs, outputs and assumptions a developer may make before, during and after the execution of a specific piece of code. The typical objective is to "fail hard and fail fast" but I think it should be...

"Fail hard, fail fast and fail informatively"

In code terms this means checking the validity of each input parameter and where data is returned doing the same. When an exception is encountered throw the correct type of exception (OutOfRange, Format, ArgumentNull) to clearly show why it has failed and if possible a much more informative message however care must be taken depending upon how public the audience this message may have - you may be giving away too much information to a hacker. Remember though the point of exception is often not the culprit - a null parameter is not the fault of the method that receives it however tighening up here does bring some nice side-effects.

Consistency in how you organise and layout your code is not critical but certainly has a minor role in TTF. Pre and post checking organises the layout of a method by decluttering it. Pre checks are performed - the meat of the method can safely assume all is well with the world and hand off the result once complete to the post checks to ensure it has done what it should. The bonus in .Net is using #regions to hide these checks so clarifying the code even further. Another positive side effect - you can use your nice clear method comments to help you write the checks or vice-versa...once your code is complete you can comment the exact behaviour of the method. The former is my preferred course of coding and again the subject for another day.

What next?

What are the practical recommendations I can give you to help improve your debugging and TTF?

I believe the magic ingredient is to make the diagnostic output of your application part of your peer review and certainly part of your acceptance or QA tests. I cannot stress this enough - like any part of a system you rely on you need to test it and test it again to improve the quality. There is absolutely no point in putting ANY debugging/tracing events into your code if they have no value in diagnosing a problem. To ensure they do this job you should use your unit tests to test drive the trace output. I even suggest that you have unit tests dedicated to this - they should deliberately provoke a problem in a component in order to provide a set of debug events. There is no "programatic" way of inspecting and passing or failing these, its simply a means to provide a trace earlier in the development cycle and get it inspected for suitability. It is actually very interesting reading a trace back - I have spotted several bugs as the trace didn't make sense.

Another tip is to formalise the importance of debugging and tracing within the code. Generally there needs to be a contiguous trace throughout the software stack for reasons stated already. For a component to be part of the trace, create an interface called "ISupportsAuditing" - this interface has a single get/set property called "Auditor" - this provides a handle to an instance of the component that actually does the tracing/auditing. This auditor/tracer component can be created externally (in a web page, web service) and used to set the "Auditor" property (alternatively you can set it in the constructor). You can now easily place calls to the auditor/tracer within your component. Using an interface for the auditing component is a wise move too as it allows a soft-wiring of your code to the auditing/tracing component.

To summarise...

Diagnosing and resolving bugs within operational software requires good quality information. This allows you to find the bug quicker and fix it - TTF is reduced and software quality is improved.
Simplify the trace/audit levels to avoid confusion and inconsistency in application. You can do this by creating your own auditing/tracing sub system to simplify and wrap standard libraries like Enterprise Library or Log4Net.
Formalise the auditing/tracing within the code by explicitly implementing an "I Support Auditing/Tracing" interface on complex or mission critical components.
Road test the trace as soon as possible to ensure it does the job it is intended for. If it doesn't then it is pointless! Use unit tests to provide trace output for inspection and review these as part of peer/QA/code review and acceptance. Could you zero in on a bug (someone elses) from the trace alone? Could you offer 3 ideas for where/why the bug occured prompted from the trace alone?
Ensure the information in the trace allows isolation and location of information - if there is a lot of it you will need to be able to find the particular call from potentially hundreds!
I strongly recommend you use the System Windows Debug (Diagnostics.Trace) to capture output - then use DebugView to capture this to file. It has several options for file roll over which makes it suitable for long term capture (say in a soak test or staging environment).

Finally - my experience is that the operational aspect of software is often overlooked. This mistake is made again and again and the time it takes to fix a "bug" is never questioned. I think that this "time to fix" (TTF) can be reduced if you ensure that you provide the information you need to do this. To ensure that this information is of sufficient quality it needs to be inspected and tested as part of your software QA/testing process. Either way you are a winner by reducing your TTF - you either fix N bugs quicker so save on cost here or you can fix more bugs in the same amount of time thus improving the quality of your software. In both cases inspecting the debug trace often reveals subtle bugs as the trace just doesn't read "right".

Think about it - are you spending too long fixing your bugs?

James Simmonds

Search This Blog

Debugging: Through the looking glass

Labels

Comments

Popular posts from this blog

Deployment - the final (.Net) frontier

Configuration in .Net 2.0

Walk-Thru: Using Wolfpack to automatically deploy and smoke test your system