Quines (Self-Replicating Programs)

13/3/2014 Quines (self-replicating programs)
http://www.madore.org/~david/computers/quine.html#rmk_miscode 1/20
Quines (self-replicating programs)
Table of contents
What is a quine? What is this page?
Introduction
A first attempt and example
Principles for writing a quine
A second example: added clarity
The fixed-point theorem
Multi-quines: making use of introns
Bootstrapping: recovering the code from the data
Recapitulation
Self-interpretation: using data as code
Conclusion
Links related to quines
What is a quine? What is this page?
A quine (or selfrep) is a computer program which prints its own listing. This may
sound either impossible, or trivial, or completely uninteresting, depending on your
temper and your knowledge of computer science. Actually, it is possible, and there are
some interesting ideas involved (in particular, writing a quine is not a hack that only
works because the programming language has certain nice properties it is a
consequence of the general so-called fixed-point theorem, itself an instance of
Cantor's ubiquitous diagonal argument).
Quines are so named after the American mathematician and logician Willard van
Orman Quine (1908/06/252000/12/25) who introduced the concept. This page is
dedicated to his memory.
I also dedicate this page to Douglas R. Hofstadter, who coined the name (in his justly
famous book Gdel, Escher, Bach) and who so clearly explained quines' importance and
their relation with Gdel's incompleteness theorem.
Introduction
A quine is a program which prints its own listing. This means that when the program
is run, it must print out precisely those instructions which the programmer wrote as
part of the program (including, of course, the instructions that do the printing, and the
data used in the printing).
The easiest way to do that, of course, is to seek the source file on the disk, open it, and
print its contents. That may be done, but it is considered cheating; besides, the
program might not know where the source file is, it may have access to only the
compiled data, or the programming language may simply forbid that sort of
operations.
The interesting thing is that writing a quine does not depend on any kind of hack such
as being able to read a source file, or even being able to represent quotes in several
different ways. Any programming language which is Turing complete, and which is
able to output any string (by a computable function of the string as program this is
a technical condition that is satisfied in every programming language in existence) has
a quine program (and, in fact, infinitely many quine programs, and many similar
curiosities) as follows by the fixed-point theorem. Moreover, the fixed-point theorem is
constructive, so the construction of the quine is merely a matter of patience, not
guesswork (or intelligence as some prefer to call it ;-). This is not to imply, of course,
that actually writing a short or interesting quine may not demand a lot of cleverness.
Still, it says that there is nothing magical behind quines; and also nothing says that
they have to be obfuscated, difficult to read, or devoid of comments, as they often are.
A first attempt and example
We try writing a quine in C. We choose C because it is widely known, and also because
the printf() function has features which will make writing a quine considerably easier
(this is a mixed blessing: it is a gain because it makes the quine smaller, but it also
makes it sensibly more obscure and hackish).
We will want the quine to be correct C code, so it will probably have to begin
something like this:
#include <stdio.h>
int
main (void)
{
The first thing we want to do is print all what precedes. Navely, we could write:
printf("#include <stdio.h>\n\nint\nmain (void)\n{\n");
Then we need to print this line itself:
printf("printf(\"#include <stdio.h>\\n\\nint\\nmain (void)\\n{\\n\");\n");
And so on. It should be obvious that this is not going to work (except if we intend to
produce a quine of infinite length, which we do not).
This is the sort of reasoning which makes some people believe that quines don't exist.
The problem is that we need to print something, so we use a character string (say s) to
print it, and then we need to print s itself, so we use another character string, and so
on
But wait! If we intend to print s, we don't need another string: we can use s itself. So let's
give it another try:
char *s="#include <stdio.h>\n\nint\nmain (void)\n{\n";
printf(s); printf("char *s=\"%s\";\n",s);
Well, it still doesn't work. But we have introduced one of the central ideas in quine-
writing lore: whereas it is probably necessary to use some data to represent the code to
be printed, on the other hand it is possible to reuse these data to print the data
themselves. Here we're still a bit nave: we're using s as it stands, but that won't work
because it contains some backslashes; these would need to be further backslashified.
So we have two paths before us: the King's way is to proceed with backslashification,
which will work because this is a computable process. However, since we are writing
in C, we choose a shortcut which uses the nice properties of the printf function:
char *s1="#include <stdio.h>%c%cint%cmain (void)%c{%c";
char *s2=" char *s1=%c%s%c;%c char *s2=%c%s%c;%c";
char n='\n', q='"';
printf(s1,n,n,n,n,n);
printf(s2,q,s1,q,n,q,s2,q,n);
This is a partial quine: it prints the beginning of its own listing (something in no way
remarkable, since any program which doesn't print anything is a partial quine).
Here we have passed the catching up point, by this I mean that the program data
printed includes the data representation itself. It is then generally trivial to complete
the quine (here, things are still a bit tricky because we've been doing things in a more
or less ad hoc manner, and some of the data are actually hidden in the printf()
statements. Nevertheless, it is not very difficult to finish:
#include <stdio.h>
int
main (void)
{
char *s1="#include <stdio.h>%c%cint%cmain (void)%c{%c";
char *s2=" char *s%c=%c%s%c;%c char *s%c=%c%s%c;%c";
char *s3=" char n='%cn', q='%c', b='%c%c';%c";
char *sp=" printf(";
char *s4="%ss1,n,n,n,n,n);%c";
char *s5="%ss2,'1',q,s1,q,n,'2',q,s2,q,n);%ss2,'3',q,s3,q,n,'p',q,sp,q,n);%c";
char *s6="%ss2,'4',q,s4,q,n,'5',q,s5,q,n);%ss2,'6',q,s6,q,n,'7',q,s7,q,n);%c";
char *s7="%ss2,'8',q,s8,q,n,'9',q,s9,q,n);%ss2,'0',q,s0,q,n,'x',q,sx,q,n);%c";
char *s8="%ss3,b,q,b,b,n);%ss4,sp,n);%ss5,sp,sp,n);%c";
char *s9="%ss6,sp,sp,n);%ss7,sp,sp,n);%ss8,sp,sp,sp,n);%c";
char *s0="%ss9,sp,sp,sp,n);%ss0,sp,sp,n,n,n);%c return 0;%c}%c";
char *sx="--- This is an intron. ---";
char n='\n', q='"', b='\\';
printf(s1,n,n,n,n,n);
printf(s2,'1',q,s1,q,n,'2',q,s2,q,n); printf(s2,'3',q,s3,q,n,'p',q,sp,q,n);
printf(s2,'4',q,s4,q,n,'5',q,s5,q,n); printf(s2,'6',q,s6,q,n,'7',q,s7,q,n);
printf(s2,'8',q,s8,q,n,'9',q,s9,q,n); printf(s2,'0',q,s0,q,n,'x',q,sx,q,n);
printf(s3,b,q,b,b,n); printf(s4,sp,n); printf(s5,sp,sp,n);
printf(s6,sp,sp,n); printf(s7,sp,sp,n); printf(s8,sp,sp,sp,n);
printf(s9,sp,sp,sp,n); printf(s0,sp,sp,n,n,n);
return 0;
}
Here we have a real quine (if you find it obscure, do not worry, much clearer examples
will be given further below). Note the use of the s2 string to print several lines modeled
on the same pattern. Also note how the backslash required no special treatment. And
note the sx string which goes to show that the classical belief that everything in a
quine must be doubled, is false (the meaning of the term intron, which comes from
molecular biology, will be made clearer below).
This quine is intermediate in elegance: on the one hand it does not assume that the
computer is using an ASCII character set (you see a lot of C quines which use the fact
that double quotes have ASCII code 34 and that line feed has code 10), it is valid ANSI
C (with a warning, however, to the fact that I should have written const char *
rather than just char *; this is much better than many quines which omit the return
0 at the end or similar things), and the longest lines are just 80 characters (often
quines have terribly long lines). On the other hand, the formatting is inelegant: don't
conclude from the above example that quines need be so badly presented. Also,
nothing says you can't have comments within quines. We will give much more elegant
examples later.
Principles for writing a quine
The basic idea is this:
It is impossible (in most programming languages) for a program to
manipulate itself (i.e. its textual representation or a representation from
which its textual representation can be easily derived) directly.
So to make this possible anyway, we write the build the program from two
parts, one which call the code and one which we call the data. The data
represents (the textual form of) the code, and it is derived in an algorithmic
way from it (mostly, by putting quotation marks around it, but sometimes
in a slightly more complicated way). The code uses the data to print the
code (which is easy because the data represents the code); then it uses the
data to print the data (which is possible because the data is obtained by an
algorithmic transformation from the code).
This idea is summarized by the sentence quine quine. Here, the verb to quine
(invented by Douglas R. Hofstadter) means to write (a sentence fragment) a first
time, and then to write it a second time, but with quotation marks around it (for
example, if we quine say, we get say say). Thus, if we quine quine, we get
quine quine, so that the sentence quine quine is a quine In this linguistic
analogy, the verb to quine, plays the role of the code, and quine in quotation
marks plays the role of the data.
We will henceforth use the words code and data a lot, to designate the code and
data parts of the quine as just explained.
If we are to take an analogy with cellular biology (thanks to Douglas Hofstadter again),
what I have called the code would be the cell, and the data would be the cell's
DNA: the cell is able to create a new cell using the DNA, and this involves, among
other things, replicating the DNA itself. So the DNA (the data) contains all the
necessary information for the replication, but without the cell (the code), or at least
some other code to make the data live, it is a useless, inert, piece of data.
Note how the data may contain (depending on how it's interpreted) bits that aren't
used to write the code, but are still copied when the data is written on the output. Such
bits are called introns, in analogy with the parts of the genetic code which aren't used to
produce proteins. The example we gave above had an intro (the string sx), clearly
marked as such. Quite obviously an intron can be modified with great ease; it is a kind
of subliminal information that is reproduced with the quine, although it is not
necessary to the quine. The possible existence of introns will be the key feature making
multi-quines (something we will talk about later) possible.
One word of warning: this code/data distinction in quines is pleasant and often
helpful. It is not, however, completely valid in all circumstances. Sometimes the code
and the data are not well distinguised, sometimes part of the code plays a data role, or
vice versa. Some quines are far beyond my own modest understanding and beyond
my feeble attempts at classification and order. As in all things, caveat emptor. See this
remark later in the text, however.
A second example: added clarity
We now use the principles outlined above to construct anoter quine, one which will be
more elegant in its formating (but a bit less portable because we will assume an ASCII
coding of characters).
This time, we gather all the data in one place, one array containing the ASCII values of
the characters making up the code, and we place this array at the beginning of the
program. The code will use the array to first print the array (by printing it as a list of
hexadecimal integers with a proper formatting) and then print the code (by converting
the ASCII values to characters).
This is completely straightforward, and while this quine is far from the shortest, I
think it is the clearest I have ever seen:
/* See comments below */
const unsigned char data[] = {
/* 000000 */ 0x2f, 0x2a, 0x20, 0x54, 0x68, 0x69, 0x73, 0x20,
/* 0x0008 */ 0x69, 0x73, 0x20, 0x61, 0x20, 0x73, 0x65, 0x6c,
/* 0x0010 */ 0x66, 0x72, 0x65, 0x70, 0x20, 0x28, 0x71, 0x75,
/* 0x0018 */ 0x69, 0x6e, 0x65, 0x29, 0x20, 0x70, 0x72, 0x6f,
/* Several lines snipped. See the original file for a complete listing. */
/* 0x02c0 */ 0x20, 0x28, 0x64, 0x61, 0x74, 0x61, 0x5b, 0x69,
/* 0x02c8 */ 0x5d, 0x29, 0x3b, 0x0a, 0x20, 0x20, 0x72, 0x65,
/* 0x02d0 */ 0x74, 0x75, 0x72, 0x6e, 0x20, 0x30, 0x3b, 0x0a,
/* 0x02d8 */ 0x7d, 0x0a,
};
/* This is a selfrep (quine) program. It uses the above data (which
* is no other than the ASCII representation of everything starting
* from this comment) to print its own listing. */
#include <stdio.h>
int
main (void)
/* The main program. We output the data in the format used at
* the top of this file, and then we use it to generate the rest
* of this file. */
{
unsigned int i;
printf ("/* See comments below */\n\n");
printf ("const unsigned char data[] = {");
for ( i=0 ; i<sizeof(data) ; i++ )
{
if ( i%8 == 0 )
printf ("\n/* %0#6x */",i);
printf (" %0#4x,", data[i]);
}
printf ("\n};\n\n");
for ( i=0 ; i<sizeof(data) ; i++ )
putchar (data[i]);
return 0;
}
This should make it obvious that there is nothing difficult at all in writing quines. In
fact this is the sort of quines we obtain by directly applying the fixed-point theorem.
As mentioned, the code contains two parts: that which copies the data (the nine lines
following the blank one in the main() function) and that which uses the data to copy
the code (the next two lines).
Naturally, the coding of the data might be much more complex than a straightforward
ASCII encoding. We will return to that subject. Also note that here there are no introns,
because ASCII does not permit this (there are no comments or any such things).
However, we could trivially add an intron: create a new array, const unsigned char
intron[], say, put whatever data we want in it, and use the same printing routines for
intron[] as we did for data[] (of course, we need to modify the code, hence the data
also, to do this, but once it is done, we can put anything in the intron without
modifying anything).
Another point is to be noted: in what precedes I have omitted a great many lines from
the data. Had I not given a pointer to the original file, could you have reconstructed the
data? Evidently, yes, and without much difficulty: just take the code, take the ASCII
value of each character, and tabulate them. This violates the so-called Central Dogma,
stating that the data must be used (by the code) to deduce (i.e. to print) the code, but
not the converse. In practice, though, there is nothing wrong with violating the Central
Dogma, in fact, you can guess that I wrote the program by first writing the quote and
then calculating the data; however, introns cannot be reconstructed in that way (since
the very point about introns is that all possible data will work).
What if a part of the code had been missing? Then things are much better off. For
example, if the comments had been gobbled, running the program itself would have
restored them (from their value encoded in the data). Even without any code at all, you
would probably have guessed that the data was the ASCII representation of something
and been able to restore the something in question. But see the section on
bootstrapping for more about this.
The fixed-point theorem
I have mentioned the fixed-point theorem and stated that it is at the heart of the
existence of quines. I will now explain what this theorem states.
(Note that this is just one of very many fixed-point theorems abounding in
mathematics. This has nothing to do, for example, with Brouwer's fixed-point theorem.
I don't know that any specific name is attached to this one, but I suspect it would be
something like Kleene's fixed-point theorem.)
I assume no familiarity with the theory of computability. However, it will help: if you
are not familiar with it, what I am going to say may sound a bit vague (but read it
anyway, because you probably will grasp the idea even if the details are obscure).
Before I can state (and prove) the fixed-point theorem, I will recall some basic notions:
A computable (or general recursive) function (of several integer variables, and
with integer values) is one which is calculated by some program (i.e. some Turing
machine operating on the variables as input, i.e. some algorithm, by whatever
definition you want to take of the word algorithm since by the Church-Turing
thesis they're all equivalent). By a partial function we mean one which is not
necessarily defined on all possible values of the input variables. By a total
function we mean one which is.
We choose some numbering of programs. Any (primitive recursive) numbering
will do (e.g. associating with a program the value obtained by considering the
program as the binary representation of some integer). We write
n
() for the
result of the n-th program when fed the input represented by the ellipsis. In
particular, any computable function is equal to
n
for some n. We will not
distinguish a program from its associated number.
The universality theorem states that the (partial) function
n
, considered as a
function of n plus its other values, is itself computable. In other words, there is a
u (a universal Turing machine if you want, or, in simpler terms, an interpreter) such
that
u
(n)=
n
(). So, effectively, this means that you can construct a program
u that will take a program n and some arguments, and return the value of n
applied to the arguments in question. This means that u is an interpreter, which
takes a program and interprets it, so the universality theorem merely states the
existence of an interpreter (of the programming language considered, written in
the programming language considered). The universality theorem is a
consequence of the Church-Turing thesis, i.e. our belief that we have grasped all
notions of computability.
The smn theorem is essentially the converse of the universality theorem. It states
that if g()=
n
() is computable then for any x the function g(x)=
n
(x)
(obtained by fixing the value of one of the input parameters) is computable. So
there exists a (computable, total, and in fact primitive recursive) function s such
that
s(n,x)
()=
n
(x). This is really a triviality: it states that if you have a
program n taking some input, you can (for every x) construct a program s(n,x)
that will act as n except that it takes x as input; moreover, this other program is
derived algorithmically from the first. So, in effect, you can substitute a value for
an input in a program.
Using the smn theorem and the universality theorem we can prove the fixed-point
theorem. This states that for any computable total function h there exists an index (a
program) n such that
n
()=
h(n)
().
In plain English, this means that if you have any algorithmic transformation h on
programs then there exists a program n such that the program n does the same thing
as the program n resulting of the transformation. We will explain this with further
examples in a second, but first we prove the statement.
For a given program t, we consider the program s(t,t) (given by the smn
theorem). Essentially, s(t,t) performs what t does when it is fed itself as
input. We further consider the program h(s(t,t)) which results from the
tranformation h applied to s(t,t). Now by the universality theorem, there
exists an index m such that
m
(t)=
h(s(t,t))
(). In other words, there is a
program m which takes a program t as input, and performs what the
program h(s(t,t)) does. Then I claim that the program n=s(m,m) is the
desired fixed point. Indeed,
n
()=
s(m,m)
(). But by definition of s, this is
m
(m), which in turn, by definition of m, is
h(s(m,m))
()=
h(n)
(), quod
erat demonstrandum.
To summarize the proof, we have taken the program m which, given a program t,
interprets the program resulting of applying the given transformation h to t acting on
itself, and we have applied that program to itself.
How does the fixed-point theorem prove the existence of quines? This is very simple:
for a given program t, consider the program h(t) that prints the listing of t. Obviously
this h is computable. Now the fixed-point theorem tells us that there is a program n
such that h(n) and n do the same thing, i.e. printing the listing of n. So n prints the
listing of n.
In practice, how do we construct n? Well, the proof of the fixed-point theorem answers
this question as well. Since the proof used the universality theorem, it may seem like
we need to construct an interpreter to apply the theorem. In fact, we need not: indeed,
if you look closely at the proof, you will see that we used the universality theorem only
for programs of the form h(), so that we need only construct an interpreter for those
programs; for our particular choice of h, this is trivial.
So consider a program t, taking an argument. We will assume that this argument is
given as a variable data to be linked with the program. Then s(t,t) is the program
obtained by setting this variable data to the textual value of the program (as a string,
say, or as whatever coding we have chosen). Our program m takes an argument t (in
the form of the data variable) and performs what h(s(t,t)) does, i.e. it prints the listing
of s(t,t), which is none other than the listing of t with a definition of the variable data to
be the text of t. And finally for our program n we take s(m,m), that is, we take this
program m and link the variable data to be the text of the program. Quite evidently,
this is precisely what we have been doing.
The fixed-point theorem has other amusing applications. Essentially, its intuitive (and
effective) content is that a program may use its own source as a variable, i.e. adding to
a programming language the ability for a program to manipulate itself (its source
code) does not add to its expressive power. So there exists a program that compresses
its own listing; there exists one which prints its own MD5 checksum (this is much
easier than finding a program indeed any file that contains its MD5 checksum;
still, someone I know thought it was impossible except by brute force how rude
so I wrote such a program and won a bet like that); there exists a program that prints a
second, different, program, that prints the first one again (here, h(t) would merely be a
program that prints a lot of print calls for the various lines of t's listing); and so on.
(A passing note, which you may find a bit difficult to understand if you're not used to computability
theory.) A different, perhaps more satisfactory, way of stating the fixed-point theorem would be to
eliminate the universality theorem from it, and to say: for every computable function k there exists a n
such that
n
()=k(n). This corresponds more precisely to the intuitive content we have described. It
is proved without the use of the universality theorem, using only the smn theorem (for the actual proof,
take the proof we have just given, and replace
h(x)
() by k(x) everywhere). The advantage of
formulating things like this is we see that it also works for primitive recursive functions (which satisfy
smn but not universality), so in effect a primitive recursive function can also make use of its own
number. By applying the universality theorem (the function
h(x)
() is computable, so we can call it
k(x)) we recover the fixed-point theorem as we have stated it. The examples we have given of the
fixed-point theorem actually use the more restritive (non-universal) we have just stated. The following
examples will use universality (and don't work for primitive recursive functions, which is clear because
primitive recursive functions always terminate).
There also exists a program that interprets its own listing: we will return to this. Also, if
we take for h the function which to a program x associates the program which
calculates what x does, and, at the end (provided x terminates, of course) adds 1, we
would have a program x which does the same thing as running x and adding 1 to the
result, and that is only possible if x does not terminate, so that the fixed-point theorem
also proves the existence of an endless loop.
Exercice: Louis Reasoner believes that the fixed-point theorem proves the existence of polyglot programs
(i.e. programs that are valid and do the same thing in several different programming languages). His
argument is this: for a given program t (in a first programming language) consider a translation of t in
a second programming language, and interpret this program literally in the first language, giving h(t).
By the fixed-point theorem, there exists n such that h(n) and n have the same effect, i.e. the text of the
program h(n) has the same effect in the first language (that is h(n)) and in the second (that is n). What
do you think of this argument?
Answer to the exercice (in rot13): Ybhvf vf rffragvnyyl pbeerpg, ohg gurer vf abguvat cebsbhaq urer.
Gurer vf n uvqqra nffhzcgvba, anzryl gung gur frpbaq ynathntr vf noyr gb vagrecerg nal cebtenz
gung vg vf srq: gurer vf ab jnl gb erfgevpg gb inyvq cebtenzf (naq pregnvayl vs gur svefg ynathntr
npprcgf bayl cebtenzf ortvaavat jvgu na N naq gur frpbaq ynathntr bayl cebtenzf ortvaavat jvgu n O, jr
jbhyq unir n uneq gvzr svaqvat n cbyltybg). Abgvpr gung gur frpbaq ynathntr qbrfa'g rira unir gb or
Ghevat-pbzcyrgr. Fhowrpg gb gur vagrecergngvba tvira nobir bs gur svkrq-cbvag gurberz, jung
Ybhvf' nethzrag nzbhagf gb vf guvf: gur cebtenz (jevggra va gur svefg ynathntr) jvyy eha na
vagrecergre bs gur frpbaq ynathntr ba vgf bja fbhepr pbqr (fbzrguvat jr pna qb gunaxf gb gur svkrq-
cbvag gurberz); abj rivqragyl fhpu n cebtenz qbrf gur fnzr guvat va obgu ynathntrf, anzryl vagrecerg
gur fbhepr pbqr va gur frpbaq ynathntr. Guvf vf abg irel hfrshy sbe pbafgehpgvat n P/Crey cbyltybg
sbe rknzcyr!
The fixed-point theorem gives a different point of view on quines from the one we have
given so far. The ideas we have already expressed, notably the code/data dichotomy,
are perhaps not very clearly apparent. Still, they are present: we should consider the s
function from the smn theorem as a mean of adding data to a program (which would
otherwise receive this data as an input), so the expression s(m,m) which we have seen
says, in effect, add to the program m (the code) a representation of the program m itself
(the data). Introns can exist because the function s is free to add extra data to the data
required of it, if it wants.
Multi-quines: making use of introns
We start by saying what a bi-quine (or more generally a multi-quine) is. To begin, here
is what it is not: a bi-quine is not a program which prints a second program, which in
turn prints the first again (actually, it is that, but things are a bit more subtle). This is
too easy to do (we have proved the existence of such using the fixed-point theorem):
one program is almost a quine, and the other is merely a sequence of calls to print the
code of the other one.
A multi-quine is also not a polyglot quine (a quine that can be read, and is a quine, in
several different languages). True, polyglot quines actually are multi-quines if you
think well about it (the converse is not true), but polyglot quines don't exist for every
combination of programming languages (although it is true that some people have
been incredibly smart at constructing them) whereas multi-quines do polyglot
quines are a hack whereas multi-quines are a general phenomenon.
A bi-quine is a very interesting kind of program: when run normally, it is a quine. But if
it called with a particular command line argument, it will print a different program, its
brother. Its brother is also a quine, but in a different programming language, so its
brother prints its own listing when run normally. But when run with a particular
command line argument, the brother prints the listing of the original program. So in
effect, a bi-quine is a set of two programs each of which is able to print either of the
two. More generally, a multi-quine is a set of r different programs (in r different
languages without this condition we could take them all equal to a single quine),
each of which is able to print any of the r programs (including itself) according to the
command line argument it is passed. (Note that cheating is not allowed: the command
line arguments must not be too long passing the full text of a program is considered
cheating ;-).
There are several ways to prove the existence of multi-quines using fixed-point theorems. Here is one
(we leave it to the reader to fill in the missing details). We just consider the case of a bi-quine, i.e. r=2.
We consider, in language 1, a program of two parameters that will normally print the first, but that
will print the second if a special argument is passed to it. By the fixed-point theorem, we can assume
that the first text is its own listing, so that we get a program of one parameter that will print its own
listing except that it will print the parameter if called with a special argument. Do the same for
language 2. We now have two programs. Substitute one in the other: there is a program, of one
parameter, in language 1, that will print its own listing, except when it is called with a special
argument, in which case it will print a program, in language 2, which prints its own listing except
when it is called with a special argument, in which case it will print the initial parameter (passed to the
first program). Finally, apply the fixed-point theorem to that. Voil, we have the bi-quine.
So, to create multi-quines, we make use of introns (following, essentially, the proof
given just above). We have r programs, so r code sets (one in each language); besides,
each of the r programs has, in addition to its code set, r data sets, one representing each
of the r code sets (so r-1 of the data sets are introns as far as the quine structure goes)
in a given coding (in principle it would be possible for each of the r
2
data sets to use a
different coding, but there is no reason to use a different coding for various data sets
in the same program, and even between programs it is reasonable to use more or less
similar codings, at least insofar as the programming languages allow this). When
program i (running code set i in language i) is asked to produce the listing of program
j, it will use its j-th data set to produce the j-th code set, and then it will use all of its r
data sets to produce the r data sets of program j (coded in the same or in a similar
way).
In practice, we write a quine program similar, say, to the second example we have
given on this page, to which we add an intron. Using this intron, the quine is able,
when passed a particular parameter, to produce a representation (valid in the second
programming language) of the two data sets (the actual data of the quine and the
intron) followed by some data specified by the intron. Then we do the same in the
other programming language, with the data representation we have elected to produce
(and the second program, when passed the special argument, must produce data
representation as we have used in the first program). Finally, we synchronize the
introns: we use the intron of the first program to represent the code of the second
program and the intron of the second to represent the code of the first. (Remember, the
nice thing about introns is that we can change them after the quine has been written,
without removing its quinishness.)
If you would feel more comfortable with an example, I have written a C/Perl bi-quine.
(For fun, I only give out the C version: if you want the Perl version you will have to run
the program with the magic word as argument.) In the C version, c_data is the main
data set and perl_data is an intron; in the Perl version, of course, things are reversed.
(The coding is not quite the same, also, although both are hexadecimal.)
Bootstrapping: recovering the code from the data
As we have already explained and illustrated, a quine is basically a bunch of data, plus
an active part, the code, which reads the data twice: once to reproduce the data, and
once to reproduce the code; the data represents the code, and the code interprets that
representation and recovers the code. There are two parts in the code: that which uses
the data to copy the data and that which uses the data to copy the code.
Now what if we are given only the data part of the quine? In the analogy I have given
with cellular biology, this is the equivalent of having the DNA (the genetic code
the terminology is unfortunate because what the biologists, quite reasonably, call the
code, is what I have been calling the data ugh) and wanting to reconstruct a cell.
Well, it is a matter of how difficult the coding (another word to beware) is. If I give you
the following quine fragment (the data part):
const char data [] =
"#include <stdio.h>\n\nint\nmain (void)\n{\n unsigned int i;\n\n p"
"rintf (\"const char data [] =\");\n for ( i=0 ; data[i] ; i++ "
")\n {\n if ( i%60 == 0 )\n\tprintf (\"\\n\\\"\");\n switc"
"h ( data[i] )\n\t{\n\tcase '\\\\':\n\tcase '\"':\n\t printf (\"\\\\%c\", d"
"ata[i]);\n\t break;\n\tcase '\\n':\n\t printf (\"\\\\n\");\n\t break;\n"
"\tcase '\\t':\n\t printf (\"\\\\t\");\n\t break;\n\tdefault:\n\t printf"
" (\"%c\", data[i]);\n\t}\n if ( i%60 == 59 || !data[i+1] )\n\t"
"printf (\"\\\"\");\n }\n printf (\";\\n\\n\");\n for ( i=0 ; data["
"i] ; i++ )\n putchar (data[i]);\n return 0;\n}\n";
you probably won't have much trouble recovering the complete quine. This is because
the representation chosen here is completely trivial. We can proceed as follows: just
run the tiny instruction printf ("%s", data); on the above data and you get the code;
put the code and the data together, and you get a first program which is almost the
quine (it may differ in inessential factors, for example if you put the data after the code
rather than before); but this program will produce the original quine when run. This
process is called bootstrapping, and it is similar to the process of bootstrapping, say, a
C compiler (you start with an initial C compiler, which may be much simpler, much
less featureful, or much less efficient, than the C compiler you want to build, and you
run it on the sources of the desired C compiler, giving a first binary C compiler, which
you use a second time to recompile its own sources).
The possibility of bootstrapping means that to some extent quines are self-healing: if
the code is damaged but still able to use the data to recover the original code,
bootstrapping can be performed.
However, nothing says a quine must use a simple coding like ASCII. I have written a
quine that stores, in its data, a compressed (gzipped) representation of the code. This
means that whereas the code that uses the data to produce the data is trivial (it is the
same as that used in our previous example), on the other hand the code that uses the
data to produce the code is much more involved, because it must actually uncompress
the data. (The gzip format is very strange and very unpleasant to uncompress. I have
written a set of routines to decode it, which are included in the quine of course, and
which I put in the public domain if they can be useful to anyone.) Here, the gzip
program (plus a bit of interpreting the data as binary) could serve to bootstrap.
Similarly, if I give you the following piece of data:
const char data [] =
"#vapyhqr <fgqvb.u>\n\nvag\nznva (ibvq)\n{\n hafvtarq vag v;\n\n c"
"evags (\"pbafg pune qngn [] =\");\n sbe ( v=0 ; qngn[v] ; v++ "
")\n {\n vs ( v%60 == 0 )\n\tcevags (\"\\a\\\"\");\n fjvgp"
"u ( qngn[v] )\n\t{\n\tpnfr '\\\\':\n\tpnfr '\"':\n\t cevags (\"\\\\%p\", q"
"ngn[v]);\n\t oernx;\n\tpnfr '\\a':\n\t cevags (\"\\\\a\");\n\t oernx;\n"
"\tpnfr '\\g':\n\t cevags (\"\\\\g\");\n\t oernx;\n\tqrsnhyg:\n\t cevags"
" (\"%p\", qngn[v]);\n\t}\n vs ( v%60 == 59 || !qngn[v+1] )\n\t"
"cevags (\"\\\"\");\n }\n cevags (\";\\a\\a\");\n sbe ( v=0 ; qngn["
"v] ; v++ )\n {\n vs ( ( qngn[v] >= 'N' && qngn[v] < 'A"
"' )\n\t || ( qngn[v] >= 'n' && qngn[v] < 'a' ) )\n\tchgpune (q"
"ngn[v] + 13);\n ryfr vs ( ( qngn[v] >= 'A' && qngn[v] <="
" 'M' )\n\t\t|| ( qngn[v] >= 'a' && qngn[v] <= 'm' ) )\n\tchgpune "
"(qngn[v] - 13);\n ryfr\n\tchgpune (qngn[v]);\n }\n erghe"
"a 0;\n}\n";
you will have no trouble recovering the original program if you have a little bit of geek
culture, but you probably get my point anyway.
In fact, let us take an extreme example: I have written a quine that stores its code
enciphered with the blowfish cryptographic algorithm (by Bruce Schneier) in its data.
Of course, the key is part of the code (without the key, the data is useless). Moreover, I
have added an intron to the program, which is encrypted with the same key. When the
program is run with the magic word as argument, it deciphers (and prints) the intron
rather than printing its own listing. This has an amusing consequence: if the key is
removed from the listing, then practically nothing is missing from the code, and yet it
is impossible to bootstrap; even though we have most of the plain code, the complete
ciphered data and secret, we can't do much with it because all is locked by a key (and
blowfish is not known to be vulnerable to a known-plaintext attack). In fact, the
situation is even more ironic than that since the key is present in the crypted data: we
are, essentially, in the situation of someone locked outside his home with the key
inside.
(Note that in writing this quine I have implemented the blowfish encryption and decryption algorithm
in fact, the quine contains the full functions, far more than are necessary for what it does. I put these
functions in the public domain: you can find them here without the quine part. Be careful: although I
am using this just for fun, this is nevertheless strong crypto. So be careful about your local crypto
laws.)
A point might be made here about the distinction between code and data: here I claim
that the key is part of the code and not the data. The difference is not so much in how
the key is used as in how it is stored. In fact, if the key is in the code (as in my quine)
the program's skeleton is basically this:
/* Lots of encrypted data corresponding to everything starting
* from the next comment. */
};
/* Code starts here */
/* Decryption routines omitted. */
const char key[] = "Foobar";
int
main (void)
{
printf ("const unsigned char data[] = {\n");
pretty_hexadecimal_printout (data);
printf ("};\n\n");
decipher (key, data);
return 0;
}
and as explained, if the key is removed, it is locked inside the house. However, if we
had some magical way of deciphering blowfish, we could recover the key (even if our
magical method did not let us do this a priori) because it is part of the code, so it is
stored among the encrypted data. On the other hand, if the key is data, the program
looks like this:
const char key[] = "Foobar";
/* Lots of encrypted data corresponding to everything starting
* from the next comment. */
};
/* Code starts here */
/* Decryption routines omitted. */
int
main (void)
{
printf ("const char key[] = \"%s\";\n\n", key);
printf ("const unsigned char data[] = {\n");
pretty_hexadecimal_printout (data);
printf ("};\n\n");
decipher (key, data);
return 0;
}
This may not appear very different, but it is. This time, there isn't a copy of the key
inside the house. The key is part of the data, it is the only part of the data that is
stored in clear. I think there is something to this idea of distinguishing the code and
data parts of a quine not by what they are used for but how they are printed.
While it is true that some parts of the code can be recovered by a bootstrapping
process, on the other hand, the data can never be recovered in that way. Any part of a
quine which, if it is modified, does not change the program output (meaning that the
program output is still the original quine), is not data, it is code. (This applies, for
example, to the comments inside the data section of the program.) (Well, all right, I
guess there is room for discussion.)
However, the data contains parts of a different nature: when they are modified, the
output produced by the program is modified, but it remains a quine. Those are the
introns we have already much talked about. In a way, introns represent the exact
opposite of the principle of bootstrapping: in the case of bootstrapping, we hope that
after a certain number of iterations we will hit the original program again; but if we
modify an intron, the program remains a quine, so it will not heal itself, it will just
remain in its modified form.
Recapitulation
I have been introducing a great many names and concepts. I will summarize them here.
A quine (or selfrep) is a program that prints its own listing.
A multi-quine is a collection of several quines, each one of which is able to print
either its own listing or any of the other ones.
The code section of a quine is that which uses the data to print the program; it is
printed by interpreting the data section (which may imply unlocking it with a key
or some complicated operation of the sort).
The data section of a quine is that which represents the code section. It is derived
from the textual form of the code, and the code's role is to perform this operation
backward; the data is printed by reading the data and representing it in a more or
less trivial fashion (for example, tabulating it in hexadecimal).
An intron is a part of the data section of a quine which can be modified in such a
way that the program remain a quine (in other words: it is modified and the
output produced by the program changes so as to follow the data modification).
Irrelevant code is a part of the quine's code section which can be modified (or
removed) so that the program still produces the same output (ergo the original
quine). In other words, bootstrapping the quine will heal the irrelevant code.
Key code is a part of the quine's code section which cannot be modified at all (if it
is modified, the program either is not correct, or does not function, or produces
gibberish; this is in contrast with an intron which if modified does not make the
quine any less quinish, or irrelevant code which if modified still produces the
same program).
Bootstrapping is the operation of running one or more times a modified version of
a quine to recover the original quine. For example, a quine can be boostrapped
from the knowledge of its data section and of some code that will perform the
function of the key code. This is a healing process that will recover the
irrelevant code.
There are analogies with compilers (or interpreters) of course. An intron within a
compiler would be something that cannot be bootstrapped, essentially because the
compiler (or interpreter) merely copies the behavior of the underlying system
(compiler) to itself. This is what Ken Thompson explains (he gives the example of \v in
C) in his Turing Award speech quoted in the links section below. Irrelevant code
differences are differences between two compilers which perform the same task (i.e.
output the same binaries) but in a different way (their binaries are different), for
example the same compiler compiled with two different compilers; then we can do a
bootstrapping, i.e. recompile the compiler and obtain the fixed-point version.
Self-interpretation: using data as code
In this section I must give my examples in Scheme rather than in C because Scheme
permits the manipulation of programs (meta-expressions) as data (symbolic
expressions).
Consider the two following elegant Scheme quine programs. First this one:
(define (line-write x) (write x) (newline))
(define (d l) (map line-write l))
(define (mid) (display "(do '(") (newline))
(define (end) (display "))") (newline))
(define (do l) (d l) (mid) (d l) (end))
(do '(
(define (line-write x) (write x) (newline))
(define (d l) (map line-write l))
(define (mid) (display "(do '(") (newline))
(define (end) (display "))") (newline))
(define (do l) (d l) (mid) (d l) (end))
))
and second this one
(define x '(
(display "(define x '(")
(newline)
(map (lambda (s) (write s) (newline)) x)
(display "))")
(newline)
(display "(map eval x)")
(newline)
))
(map eval x)
The first one is easy enough to understand, and follows the usual pattern well: the five
lines ending with the second-to-last are the data (as well as the two character
strings, I suppose), and the rest is the code. The code (the do function essentially)
uses the data (the l variable essentially) to print the code (the first (d l)) and then
print the data (the second (d l)).
But the second example is a bit strange: evidently the x variable (the lines from the
second to the eight) is data. The code, essentially, is limited to the single instruction
(map eval x). If you are unfamiliar with Scheme, this means: consider x as a list of
Scheme instructions and execute them. So what we are doing here is using the data,
in effect, as code. This is curious because the whole point of a quine, really is to use
code as data and here we are using data as code. But in a way it makes sense: if you
consider x to be written in a programming language which is just like Scheme except
that the code can be accessed as data through the variable x! Then x's rle is to print
x itself plus the interpreter ((map eval x)).
I have also written a quine in Bourne shell along the same principles. It is rather subtle
to understand, but I think it is worth the trouble. If you prefer the dc programming
languages, the compare this quine (along the lines of the first Scheme program above,
i.e. the normal lines) and that one (which also uses the data-as-code principle and it
is shorter).
I'm not entirely sure whether this way of writing quines is actually qualitatively
different from the normal way. (For example, do they correspond to a different proof
of the fixed-point theorem, perhaps one that uses one more time the universality
theorem I can manufacture such a proof but it is not really convincing.) It is true
that if we compare the two Scheme programs, or the two dc programs, given above,
there seems to be an important difference (namely, that there is much more
redundancy in the first shan in the second). But maybe that is just a nave way of
thinking. Still, I can't help but think there is some relation with the two ways of writing
the Curry Y (fixed-point) combinator: as f.((x.(f(xx)))(x.(f(xx)))) or as f.((x.(xx))
(x.(f(xx)))). But maybe I'm gone totally off my rocker there.
To conclude this section, I'd like to mention one program I wrote that I'm particularly
fond of. It is not a quine and it is in no way so impressive; but in fact it was
considerably more difficult to write than a quine. It consists of a (rather minimal)
Scheme interpreter, written in Scheme. And that interpreter is applied to itself acting
upon itself. So it is a Scheme interpreter trying to interpret a Scheme interpreter
interpreting a Scheme interpreter interpreting well, you get the picture. As each
interpreter prints some debugging information about the program it is interpreting,
this leads to a lot of output data (with curious properties; for example, search for the
string Now starting evaluation without quotes around it, and see how it becomes
logarithmically rarer and rarer). If you have read the cryptic comment I have made a
while back on the use of the universality theorem in the fixed-point theorem, this is a
case were we need the universality theorem, and indeed, it is the central part of our
program (writing an interpreter). You should also note the analogy with Gdel's
theorem, because this self-interpreting program is much closer to Gdel's theorem
than ordinary quines. Naturally, if we allow the use of the eval function (but that's
cheating), we can rewrite my program in a much simpler way:
((lambda (expr) (eval `(,expr ,expr)))
'(lambda (expr) (eval `(,expr ,expr))))
(a cute endless loop).
Conclusion
Well, I've written much more than I intended to. I wanted to make this a small page on
The Art Of Quine Programming, and it turned out to be quine (oh, what a strange slip!
I meant quite of course) a monument.
I haven't given enormously many examples, but I hope the examples I've given were
clear enough so that, if you didn't know how to write quines initially, now you do (even
if you didn't understand all that's on this page). If you want more examples, have a
look at my personal quines collection (all written by yours truly), which you can also
access by FTP, or download as a single tarball. Also look at some of the links below,
where a great number of more quines can be found.
Yow! I've just lost the SOURCE CODE for all my QUINE PROGRAMS! What will I DO
NOW with just the BINARIES?
Links related to quines
The Quines node in the Open Directory
The quine entry of the FOLDOC
The Quine entry in the Wikipedia
The quine node on Everything2
The Quine Page by Gary P. Thompson II
About the Unlambda Quine Contest
My personal collection of quines (quite small in comparison with those of Ben
Olmstead or Gary P. Thompson II)
An explanation of the above collection I posted on a local newsgroup of the ENS
(so it's in French)
A famous text (somewhat distantly related): Ken Thompson's Turing Award
speech, Reflections on Trusting Trust
Gone without forwarding address: the Quines List (formerly at
http://www.mines.edu/students/b/bolmstea/quines/index.html) by Ben Olmstead seems to have
entirely disappeared; Cat's Eye Technologies (whose address changes every year or so) used to
have a section on Quines, but now I can't find it.
David Madore (david+www@madore.org|Google+|Facebook|)

Quines (Self-Replicating Programs)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quines (Self-Replicating Programs)

Uploaded by

Copyright:

Available Formats

13/3/2014 Quines (self-replicating programs)

You might also like