Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

What is a delta?

I've been throwing around this concept of deltas, but I haven't stopped to describe them.

A tree is a hierarchy of folders and files. A delta is the difference between two trees. In
theory, those two trees do not need to be related. However, in practice, the only reason we
calculate the difference between them is because one of them is derived from the
other. Some developer started with tree N and made one or more changes, resulting in tree
N+1.

We can think of the delta as a set of changes. In fact, many SCM tools use the term
"changeset" for exactly this purpose. A changeset is merely a list of the changes which
express the difference between two trees.

For example, let's suppose that Wilbur starts with tree N and makes the following changes:

1. He deletes $/top/subfolder/foo.c because it is no longer needed.


2. He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
3. He edits $/top/bar.c to remove all the calls to the functions in foo.c
4. He renames $/top/hello.c and gives it the new name hola.c
5. He adds a new file called feature_creep.c to $/top/
6. He edits $/top/Makefile to add feature_creep.c to the list of filenames
7. He moves $/top/subfolder/readme.txt into $/top

At this point, he commits all of these changes to the repository as a single


transaction. When the SCM server stores this delta, it must remember all of these changes.

For changeset item 1 above, the delete of foo.c is easily represented. We simply remember
that foo.c existed in tree N but does not exist in tree N+1.

For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we
need each object in the repository to have an identifier which never changes, even when the
name or location of the item changes.

For changeset item 7, the move of readme.txt is another example of why repositories need
IDs for each item. If we simply remember every item by its path, we cannot remember the
occasions when that path changes.

Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item
we need to remember that tree N+1 has a file called feature_creep.c which was never
present in tree N. However, a full representation of this changeset item needs to contain the
entire contents of that file.

Changeset items 2, 3 and 6 represent situations where a file which already existed has been
modified in some way. We could handle these items the same way as item 5, by storing the
entire contents of the new version of the file. However, we will be happier if we can do
deltas at the file level just as we are doing deltas at the tree level.
File deltas

A file delta merely expresses the difference between two files. Once again, the reason we
calculate a file delta is because we believe it will be smaller than the file itself, usually
because one of the files is derived from the other.

For text files, a well-known approach to the file delta problem is to compare line-by-line and
output a list of lines which have been modified, inserted or changed. This is the same kind
of results which are produced by the Unix 'diff' command. The bad news is that this
approach only works for text files. The good news is that software developers and web
developers have a lot of text files.

CVS and Perforce use this approach for repository storage. Text files are deltified using a
line-oriented diff. Binary files are not deltified at all, although Perforce does reduce the
penalty somewhat by compressing them.

Subversion and Vault are examples of tools which use binary file deltas for repository
storage. Vault uses a file delta algorithm called VCDiff, as described in RFC 3284. This
algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been
changed. This means it can handle any kind of file, binary or text. As an ancillary benefit,
the VCDiff algorithm compresses the data at the same time.

Binary deltas are a critical feature for some SCM tool users, especially in situations where
the binary files are large. Consider the case where a user checks out a 10 MB file, changes
a few bytes, and checks it back in. In CVS, the size of the repository will increase by 10
MB. In Subversion and Vault, the repository will only grow by a small amount.

Deltas and diffs are different

Please note that I make a distinction between the terms "delta" and "diff".

 A "delta" is the difference between two versions. If we have one full file and a delta,
then we can construct the other full file. A delta is used primarily because it is
smaller than the full file, not because it is useful for a human being to read. The
purpose of a delta is efficiency. When deltas are done at the level of bytes instead of
textual lines, that efficiency becomes available to all kinds of files, not just text files.
 A "diff" is the human-readable difference between two versions of a text file. It is
usually line-oriented, but really cool visual diff tools can also highlight the specific
characters on a line which differ. The purpose of a diff is to show a developer exactly
what has changed between two versions of a file. Diffs are really useful for text files,
because human beings tend to read text files. Most human beings don't read binary
files, and human-readable diffs of binary files are similarly uninteresting.

As mentioned above, some SCM tools use binary deltas for repository storage or to improve
performance over slow network lines. However, those tools also support textual
diffs. Deltas and diffs serve two distinct purposes, both of which are important. It is merely
coincidence that some SCM tools use textual diffs as their repository deltas.

You might also like