Code Distance Visualizer

Introduction - Documentation - Screenshots - Demo - Download

Code Distance Visualizer is a static program analyzer. It learns patterns in user defined faulty and correct code instances and using visualization indicates which fragments in a program's source code are the most similar to these instances. The incremental training of the tool is done by feeding it with some training instances (in the user interface it's done by selecting some code fragment and marking it as faulty, for example).

Finally it is publicly available at sourceforge.net

Current Version: 0.4.0 [2008-07-08] (download it here!)

Introduction:

Code Distance Visualizer presents a new technique that incorporates machine learning and information visualization into static code analysis. The technique learns patterns in a program's source code using a normalized compression distance (NCD) and applies them to classify code fragments into faulty or correct. Since the classification frequently is not perfect, the training process plays an essential role.

Black box approach to learning

Actually a major weakness with systems that use machine learning techniques is that they tend to act like black boxes, i.e. it is difficult for the user to determine what exactly has been learned.

A visualization element is used in the hope that it lets the user better understand the inner state of the classifier making the learning process transparent. The White box approach to learning

coloring is done interactively, so the user is able to receive immediate feedback and see how his training affects the outcome of the classifier.

Code Distance Visualizer is written in C and makes heavy use of the following open source tools (for more details see the Documentation):

Sparse - a semantic parser for C. Using Sparse the entire program's source code is parsed to another textual representation that emphasizes codes structure rather than syntactic details. It is also used for chopping up code into fragments.
CompLearn - a suite of utilities that support compression-based learning. It was used to implement the naive NCD classifier.
GTK+ - widget toolkit for creating GUI. In addition, it served for the heatmap implementation.

So in order to run it you need to have these tools (and actually many others) installed on your machine.

Documentation:

The master thesis that describes a novel approach to static code analysis goes into more detail about this technique and how Code Distance Visualizer works, including some experiments on real world source code. It's incidentally the closest thing to a user manual there is.

Screenshots:

Check some screenshots at sourceforge.net

Demo:

The file CDV_demo.zip contains a short (no more than 5 minutes) demonstration of the Code Distance Visualizer. It shows how the tool is trained to recognize memory leaks after feeding it with a few examples of correct and faulty code. The samples of code are taken from the open source project Samba. Original files consist of more than 8K lines of code, so for the demonstration purposes we are using just the excerpts of code that contain bug-fixes (in order not to scroll up and down for looking a particular code fragment).

In addition to the technique's showing, in this small example, the strengths and weaknesses of the learning process become visually apparent, the user can discern when the tool is correct in its output and when it is not, and to take corrective action (further training or retraining) interactively (as you see, "click-by-click"), until the desired level of performance is reached. This would not be true of tools with no visualization support even though working along the same principles.

hosted by