GC

Garbage Collection Overview 2003-02-18 14:48:44 -0800
Garbage Collection Overview

Terence Parr
Introduction
For much of my career (prior to 1995 when Java appeared on the scene), debugging software was all
about answering a frustrating, but critical question: "is there a problem with my algorithm or have I
corrupted memory and, hence, the runtime system?" I have traced problems down to memory allocation
/ deallocation issues many many times.
Fortunately, software can help us write software. Just as we use IDEs and debuggers to make us more
efficient, we can use automatic garbage collection (GC) to avoid dynamic memory programming errors and
increase productivity dramatically. You are not good at tracking garbage manually and programmers
typically do lots of extra copying to resolve "who frees what" issues between programmers.
Simply put, GC relieves the programmer from having to track and deallocate dynamic memory--you do
not have to write code to deallocate data structures. GC reduces cognitive load. People estimate about
10% of CPU time for automatically collecting garbage. An excellent trade.
The basic idea and terms
The core GC strategy is:
At some frequency, distinguish live from dead objects (tracing) and reclaim dead ones.
Live objects are reachable from roots, which are globals, locals, registers of active methods etc...
Analogy: Imagine walking to the refrigerator and getting out a bowl of grapes. You pick up the bunch by
the stem and look in the bottom of the bowl; there are a bunch of black and blue moldy grapes--that's
the "garbage". Anything not reachable from the stem has gone "bad." [This analogy is attributed to Randy
Nelson, now at Pixar, formerly of the Flying Karamazov Brothers].
Imagine a heap of dynamic memory for a running program. Certain variables point into the heap to your
data structures. When those variables no longer point at a data structure, nothing can reach the data
structure so it's garbage:
http://www.cs.usfca.edu/~parrt/course/652/lectures/gc.html Page 1
After garbage collecting, you'd see something like this:
GC often implicitly assumes that the objects in memory are typed so you know how to find the pointers
within, say, an AST node. This does not mean that GC only works in interpreters--compiled languages
such as C++ can store runtime type information now.
Further, w/o type information you can still do conservative collection. If it looks like a pointer, assume it
is. That unfortunately leaves some objects around because you still (erroneously) think you have a
pointer to those objects (in fact they are probably just integers). Java has lots of type information in the
objects as well as the bytecodes that operate on data, hence, does not have to be conservative in
general. Conservative is bad in my opinion since "conversative GC implies leaking memory".
There are two main user-perspective categories of GC: disruptive and nondisruptive (often called
pauseless). I can remember SmallTalk and LISP programs halting in the middle and saying
"Sorry...garbage collecting...". Ack! A disruptive GC is one that noticeably halts your program and usually
means it's doing a full collection of a memory space and literally turning off your program for a bit. If
you interleave little bits of collection alongside the running program, you call it an incremental collector.
Incremental collectors are a lot harder to implement because you must deal with the program altering
data structures currently being sniffed by the collector. If you are building a real-time system, however,
incremental collectors are pretty much a requirement.
incremental collectors are pretty much a requirement.
Common Strategies
Reference Counting
This simplest GC strategy is reference counting, which adds a counter for every "object" in your system.
When you copy a reference to that object, you increment the count by one. When a reference to an object
goes away (such as when a local variable goes out of scope), the count is decremented by one. If, at that
time, the count goes to 0, the object is garbage and all pointers emanating from it should be
decremented. Then that object is reclaimed (as are any other objects that go to 0 during the count
update phase).
Reference counting is mostly nondisruptive, but can't handle cycles. A cycle looks like this:
and then after you dereference the object and decrement its count, it is still greater than 0 because an
object inside the heap refers to it. These objects are no longer reachable from a root and will never be
reclaimed.
Cycles occur often enough: as in circular queues, doubly-linked trees etc...
The cost is also high as it is proportional to the amount of work done in program.
Disruptive, Stop-And-Collect Schemes

Mark and sweep collectors are two-phase collectors that first walk the live objects, marking them, and
then finds all the dead objects (i.e., anything that is not live).
This is pretty easy to understand and build but badly fragments memory, wastes time walking dead
objects (assuming you don't have to run destructors), and has bad virtual memory characteristics.
An improvement on this strategy is called mark and compact because it walks memory moving all live
objects to the front of the heap, thus, leaving a nice big contiguous block of free memory afterwards.
This removes fragmentation concerns and helps locality for virtual memory because objects are kept in
same order in which they were allocated. Still have to walk the memory a lot and you have to update
same order in which they were allocated. Still have to walk the memory a lot and you have to update
pointers since you are moving things around.
If instead of moving live objects to the head of a single heap, you copy live objects to another memory
space, you have a copying collector. You only have to walk the live objects (updating their pointers as you
move them)--anything left in the old space is garbage. The term scavenging is often used to refer to this
process. This has the advantages of the mark and compact algorithm but is more efficient.
Note that if you have a finalize() method (a destructor), it implies you have to walk garbage even if
not strictly required by your strategy.
Allocation for any copying collector is fast because you just have to bump a pointer in the heap; all free
memory is contiguous after collection.
Nondisruptive, Generational Schemes
The disruptive stop-and-collect schemes are so disruptive because they have so much work to do--they
must deal with the entire heap. If your heap is 600M, then it has lots to do.
Observation: most objects live only a short time while some tend to live a long time (think about System
object in Java).
A generational collector takes advantage of this observation by having a "younger" and an "older"
generation. Objects that live a few "generations" (i.e., collection runs), are moved to the "older"
generation, which reduces the amount of live objects the collector must traverse in the younger
generation.
Note the similarity to the mark and copy algorithm; here, though, the "when to copy" algorithm is very
different. A generational copying collector moves objects to another generation when it has survived a
few generations. Mark and copy copies all live objects upon each activation, thus, not significantly
reducing its workload for future generations. Also, there may many generations, not just two spaces as
in a mark and copy scheme.
In the end, even generational schemes must stop-and-collect the older generations. Hopefully this can
be hidden from the user such as when the system is waiting for user input. Java does this by making the
collector the lowest priority thread. When nothing else is running, the collector starts up (and hopefully
finishes).
Nondisruptive, Incremental Tracing Collectors
If you must avoid stopping to do collection, you can interleave collection with the running program. Such
an incremental collector faces a concurrency problem with the mutator (the program) because while it's
looking for (tracing) live objects, the mutator can rearrange the objects.
One means of dealing with this concurrency is a write barrier, which tracks pointer writes. The write
barrier can record writes of pointers into objects and re-traverse some objects later. This allows writes
to occur while the collector is operating. Reads are obviously not an issue here.
Incremental collectors can be combined with generational schemes for added efficiency. Hard real-time
systems use incremental collectors--you need fixed-cost GC operations.
Java
Pre-JDK 1.2, Java had a conservative mark and sweep algorithm. After JDK 1.2 / HotSpot, Java uses a
Pre-JDK 1.2, Java had a conservative mark and sweep algorithm. After JDK 1.2 / HotSpot, Java uses a
generational scheme with "train algorithm" for fast GC of mature generation. I find that for very little CPU
cost, the -Xincr incremental collector to be very nondisruptive. The latest Java collector has a very fast
allocator as well.

GC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GC

Uploaded by

Copyright:

Available Formats

Garbage Collection Overview 2003-02-18 14:48:44 -0800

Garbage Collection Overview

The basic idea and terms

The core GC strategy is:

After garbage collecting, you'd see something like this:

Cycles occur often enough: as in circular queues, doubly-linked trees etc...

Disruptive, Stop-And-Collect Schemes

Nondisruptive, Generational Schemes

Nondisruptive, Incremental Tracing Collectors

You might also like