Welcome to Code Distance Visualizer
Code Distance Visualizer is a static program analyzer. It
learns patterns in user defined faulty and correct code
instances and using visualization indicates which fragments in
a program's source code are the most similar to these
instances. The incremental training of the tool is done by
feeding it with some training instances (in the user interface
it's done by selecting some code fragment and marking it as
faulty, for example).
Finally it is publicly available at sourceforge.net
Current Version: 0.4.0 [2008-07-08] (download it here!)
Code Distance Visualizer presents a new technique that incorporates machine
learning and information visualization into static code analysis.
The technique learns patterns in a program's source code using a normalized
compression distance (NCD) and applies them to classify code fragments into
faulty or correct. Since the classification frequently is not perfect, the training
process plays an essential role.
Actually a major weakness with systems that use machine learning techniques is that
they tend to act like black boxes, i.e. it is difficult for the user to determine
what exactly has been learned.
A visualization element is used in the hope that it
lets the user better understand the inner state of the
classifier making the learning process transparent. The
coloring is done interactively, so the user is able to receive
immediate feedback and see how his training affects the
outcome of the classifier.
Code Distance Visualizer is written in C and makes
heavy use of the following open source tools (for more details
see the Documentation):
So in order to run it you need to have these tools (and actually many others) installed on your machine.
- a semantic parser for C. Using Sparse the entire
program's source code is parsed to another textual
representation that emphasizes codes structure rather
than syntactic details. It is also used for chopping
up code into fragments.
- a suite of utilities that support compression-based
learning. It was used to implement the naive NCD
- GTK+ - widget
toolkit for creating GUI. In addition, it served for
the heatmap implementation.
The master thesis that describes a novel approach to static
code analysis goes into more detail about this technique and how Code Distance Visualizer works,
including some experiments on real world source code. It's incidentally the closest thing to a
user manual there is.
Check some screenshots at sourceforge.net
file CDV_demo.zip contains a
short (no more than 5 minutes) demonstration of the Code
Distance Visualizer. It shows how the tool is trained to
recognize memory leaks after feeding it with a few examples of
correct and faulty code. The samples of code are taken from
the open source project Samba. Original files consist of more
than 8K lines of code, so for the demonstration purposes we
are using just the excerpts of code that contain bug-fixes (in
order not to scroll up and down for looking a particular code
In addition to the technique's showing,
in this small example, the strengths and weaknesses of the
learning process become visually apparent, the user can
discern when the tool is correct in its output and when it is
not, and to take corrective action (further training or
retraining) interactively (as you see, "click-by-click"),
until the desired level of performance is reached. This would
not be true of tools with no visualization support even though
working along the same principles.
|Copyright © 2008 Darius Sidlauskas & Denis Kacan (dsidlauskas & denis.kacan [at] gmail.com)