Professional Documents
Culture Documents
badness-0-knuth
badness-0-knuth
I am... not like this. If a char acter in a TV com mer - For an other thing: This guy mixes fancy ty po -
cial is han dling a coffee cup but I can infer from graphic quotes and ASCII ones.
its mo ment of inertia that the cup does not con -
But the main thing I w ant to talk about is: What?
tain any liquid, I im me di ately lose sus pen sion of
No! 32/40 is four fifths, not three quar ters. This is
dis belief and will not pur chase the prod uct fea-
not, like, com pli cated math. It uses some of the
tured in the com mer cial. I literally pro jectile vomit
world’s small est integers. Every body knows that
if Auto-Motion Plus is en abled on a tele vision
the work w eek is 40 hours, and that a work day
in the ho tel I’m stay ing in, even if the TV is not
is 8 hours, and that the pro posed bill reduces it
turned on, or if some one mis uses the word “lit er-
by one day, giv ing four of five days. I don’t really
ally.” If I see a para graph miss ing a pe riod at its
mind if some one makes an error in calcu lation
end on Wikipedia, I will spend dozens of hours
(w ell, I do mind, but I am certainly prone to do ing
writ ing soft w are to or ga nize and semi-automate
it). The infu riat ing realiza tion here is that this per -
a dis trib uted effort to fix all the miss ing pe riods
son does not even think of “three-quarters” as a
on Wikipedia. [2] And worse, each time I learn of a
kind of thing that can be right or wrong. He says
new type of mis take, I am for ever cursed to no tice
three quar ters because it makes smaller num ber
that mis take
feelings. You could imag ine him hav ing the con ver-
Seriously: One time I found my self spell-correcting sation (with me, per haps): “You say four-fifths, I
some one else’s lorem ipsum text in a slide. It said say three-quarters.” Me: “But it is four fifths. And
“lorem ep som,” which is funny. I think about that why are you alw ays hy phen at ing it?” Him (smil -
incident all the time. The per son that wrote the ing pa tron izingly): “I guess w e just hav e to agree
slide prob ably thinks about things like lev erag ing to dis agree.”
syn ergy, gen erativ e AI, meta verses, blockchain 3.0,
The op po site of this per son is the hero called Don -
snack able con tent, being eco-green, and so on, with -
ald Knuth.
out it occur ring to him that these things could hav e
nu ance and mean ing sep arate from their names. I’m not say ing that Don ald Knuth isn’t suc cess ful
He has prob ably nev er even read the Wikipedia ar - and rich. Accord ing to the w ebsite “Fa mous Birth -
ticle on Lorem Ip sum. He is suc cess ful and rich. days,” [4] which is prob ably gen erated by AI or at
least by peo ple whose economic out put is mea -
An other suc cess ful per son is the con gressper son
sured in a count of words, and words whose value
Bill Cas sidy. Crit icizing a pro posed bill that would
is com puted by their abil ity to driv e ad clicks, Don -
reduce the stan dard work w eek in the US by 8
ald Knuth is “is one of the most pop u lar and rich - been invented, includ ing by his own hand, and so
est Math emati cian who w as born on Janu ary 10, he needed to rework MIX for the next vol ume, and
1938 in Wiscon sin, Wiscon sin, United States. Math - up date the first. The revised plan of eight vol umes
emati cian and en gineer who w as ar guably most remains the inten tion in 2024. How ever, he found
recog nized as the Pro fessor Emer itus at Stan ford in that the vol umes w ere get ting rather long, and be-
Palo Alto, Cal ifor nia.” As one of the rich est Math - gan releas ing por tions of vol umes (“fas cicles”).
emati cian from United States, accord ing to the So far, Volume 4 has been par tially pub lished as
analy sis of Famous Birth days, Wikipedia, Forbes books 4A [5] (fascicles 0–4; 912 pages) and 4B[6] (fas-
& Business Insider, “Don ald Knuth’s net worth cicles 5–6; 736 pages). It is un known how many
$3--5 Million. *” more episodes remain in Volume 4. I expect that
every con versation that Knuth has with his ed itor
I sup pose it is ar guable that he is the Pro fessor goes like this. Ed itor: “Hey, Don ald, I hope you’ re
Emer itus. And it is very likely true that he is w ell. Just won der ing if you hav e an up date on
the only pop u lar and rich math emati cian born when 4C will be ready? Or any more icicles?” Don -
on that spe cific day in Wiscon sin, mak ing the ald E. Knuth: “I am work ing dili gently on fascicles
sin gu lar “Math emati cian” per haps a tech nical for Volume 4C. As I’ve men tioned in the past, it’s
master-stroke. But more likely this is just an amus - im pos sible to tell how long it will be, since math e-
ingly dense series of im pre cisions. The aster isk of mat ics does not obey the rules of project man age -
course does not hav e any referent on the page. ment.” Ed itor: “I just need a date to tell the pub lish -
ers.” Don ald E. Knuth: “Like I’ve said, any date
What I mean when I say that Don ald Knuth is the would be very low con fidence, other than the fact
op po site of this per son is that Knuth is inter ested that it will be in the fu ture.” Ed itor: “I just need a
in un pack ing a sin gle un nec essary de tail, recur - date.” Don ald E. Knuth: “Would you like me to
siv ely, un til it is com pletely solv ed. Accord ing to say a date, know ing that it’s a very low con fidence
the w ebsite Famous Bibliophiles, one day Don ald guess, and that I would be extremely likely to miss
Knuth set out to write down the en tire sub ject of that date, or even de liver early?” Ed itor: “Early!
com puter science in a sin gle book called The Art Now w e’re talk ing.” Don ald E. Knuth: “What use
of Com puter Pro gram ming. As he w as do ing so, is the date if you’ re excited about the pos sibility
he realized that de scrib ing com puter algo rithms in of it being early, rel ativ e to some un known date?”
a last ing form would require a pro gram ming lan - Ed itor: “I just need a date for the pub lish ers.” Don -
guage that w as not sub ject to con stant revision, so ald E. Knuth: “2030.” Ed itor: “Thanks Don ald,
he invented the MIX instruc tion set for an ide alized you’ re the best!”
com puter. After writ ing some 3000 pages out in
long hand, he found that it w as im prac tical to print Volume 5 is estimated to be ready in 2030, when
them all in one book, so the plan expanded to be Knuth will be 92.
mul tiple vol umes. Then when he got a draft of one
of the books back from the type set ter, he w as un - That’s a large amount of lan guage!
happy with the de tails of the ty pog ra phy, and so
he paused his work writ ing down all of com puter Night mare on LLM street
science to create some new com puter science: First
an algo rithm for de ter min ing where to place line Then w e hav e Large Lan guage Mod els. [7] One of
breaks in or der to make text op timally beau tiful, the irritat ing things about LLMs is that they are so
then algo rithms for hy phen at ing words, then gen - buzz wordy, but un like most buzz wordy trends,
eraliza tions of these for type set ting math emat ics, they are actu ally sub stan tiv e. They pro duce re-
and then a full com puter type set ting sys tem that is mark ably flu ent text. With no ad di tional train ing
still in wide use today, called TeX. Along the w ay they frequently beat purpose-built mod els that
he w as un sat isfied with the spe cific type faces that hav e been in de velop ment for decades. They gen -
existed in the world, and un sat isfied with the w ay eralize to com pletely new situ ations.
that type faces w ere de scribed at only one w eight,
and so he created the pa ra me ter ized METAFONT So many things about “AI” dis tress me. Dolor sit
sys tem and sev eral new type faces. Un de terred by amet! I worry about the de valu ation of hu man cre-
these excur sions, he returned to his orig inal task ativ ity, about large-scale dis infor ma tion and spam
of writ ing down the en tirety of com puter science, ru ining the beau tiful library of know ledge that hu -
us ing all the tech nol ogy he had built. By the time mans hav e created, about extreme con cen tra tion
he finished this, much more com puter science had of w ealth. And yes, I worry about com pet ing with
AI. Being able to work tire lessly and thou sands of hav e some thing that w e w ant: Perfect ty pog ra phy?
times faster than hu mans is a huge com pet itiv e ad - This pa per is about a new type set ting sys tem, Bo-
van tage. Of course, I find some solace in the sig nif- VeX, which allows for the con trolled exchange of
icant pos sible up sides. It might help us solv e hard pre cision for beauty. It essen tially giv es us a dial
prob lems like climate change and AI. But even in betw een Lorem Ep som and Don ald Knuth. To il-
the best scenar ios w e will not be able to ignore it: lus trate, w e’ll first look at a sim pler case by inspect -
Even if it nev er gets as smart and pre cise as Knuth, ing one of my other inter ests: Super Metroid.
it’s already too econom ically use ful in its Lorem
Ep som state (just like Lorem Ep som him self). The scientists’ findings were astounding! They
discovered that the powers of the Metroid might
On the other hand, the tech nol ogy is pretty neat be harnessed for the good of civilization!
and lends itself to some nice ab strac tions. I lov e
play ing with words. So one of my side quests is Metroid is a video game series about a brain that
to mas ticate this whole scenario by exper iment - has been en slav ed inside a jar in an un der ground
ing with LLMs in prac tical and im prac tical ap pli - dat acen ter on the planet Zebes. This brain is called
cations, and to try to make it fun (for me) to pro - Mother Brain and its goal is to con trol the hy per cap i-
gram with them. tal ists called Space Pirates to increase their “score”
as high as pos sible by con quer ing plan ets through -
Many things irritate me, so this is some thing I out the galaxy. Mother Brain w as invented by the
hav e am ple expe rience with. I hav e a myr iad of Space Pirates, although it is not clear whether the
strate gies for di ges tion of them. For this work I’m cur rent situ ation w as actu ally intended by the
inspired by the “Hurry-Cow ard So-so-morphism,” Space Pirates. The most su per version of Metroid
where I make con nec tions betw een top ics based is Super Metroid.
solely on con fu sion of su per ficial lexical sim ilar i-
ties with out regard to their un der lying mean ing. In the 1990s the w ebsite gamefaqs.com collected
So for exam ple w e hav e “ML ” mean ing both “Ma - plain text “FAQs” for clas sic video games, then
chine Learn ing” and “Meta Lan guage”, as w ell as just known as video games. On this site an other
“type” both as in “type face” and as in “type sys - hero w as born. They w ere writ ing the de finitiv e
tems for pro gram ming lan guages.” [8] And because guide to speedrun ning the SNES game Super
ma chine learn ing has claimed so many words, Metroid when they saw that some of their ASCII
there are a great many shared with ty pog ra phy lines ended up exactly the same length, and that it
as w ell: looked good:
Once you save the game at your ship (about 1 hour 15 minutes is good), go
+----------+ down to Tourian. Do not save your game in Tourian if you have intentions of
returning to any previously explored section on Planet Zebes. There will be
|typography| a few Metroids to kill before you reach Mother Brain, and they must all die
+----------+ in order to continue to Mother Brain. Read the boss guide for more details.
Once Mother Brain is defeated, you will need to hurry back to your ship. By
/ \ “baseline” now you will already have the HYPER BEAM. From Mother Brain’s room, go west
“fixed point” / \ “floating point” and then south. Take the blue door at the bottom and speed dash east. Super
jump up, and continue north. Once you land up top and are running east, aim
/ \ “weight” “vector” diagonally down to the right and shoot an unseen door. Eventually, you will
get to this door since lava will start to rise from the floor in this area.
“type” / \ “descent” Speed dash through the door you preopened, and charge for a super jump. Hug
/ \ “kerning trick” either the left or right wall in the Craterian shaft and super jump up. Now
quickly get to your ship before the planet explodes. There should be almost
/ \ “dingbats” a minute left on the timer. Sit back and watch the ending! Did you beat the
/ \ “gradient” game within 1 hour and 20 minutes?
+-----------+ +--------+
| functional|---------|machine | and so they wisely de cided to word smith the en -
|programming| |learning|
+-----------+ “ML” +--------+ tire 28-page guide so that every line w as exactly
“lambda” “generalization” the same length, with no extra spaces or other
“parameter” “tensor” cheat ing, just because it could be done. [9]
There is one object type obj in BoVeX. A value of Another use is in the [tt[layout]] type.
this type has an ar bitrary set of named fields whose This is a primitive type that most of a
types are known; they can only be the base types document’s text is written in. It is a
int , float , string , bool , lay out , or obj . Fields are tree structure with optional attributes
on each node, which are represented with
dis tinct if they hav e dif ferent types. An object can an object. For example, this paragraph is
be intro duced with an expres sion like {() field1 = written in the [tt[paper.bovex]] source
exp1, field2 = exp2}, pro vided that each field’s file as:
type can be syn the sized from the expres sion itself
(in the bidi rectional type-checking sense). Alter na - The square brack ets are used to write a lay out lit-
tiv ely, the pro gram can de clare an object name O: eral (the main body of the doc u ment is inside one
large literal). Lay out literals can also em bed expres -
sions (of type lay out) with nested square brack ets. this w ay as w ell.
Here the func tion tt is ap plied to a lay out literal
that con tains text like paper.bovex . The tt func - Ty po graphic fea tures
tion just adds the font-family at tribute with value
"Fixed er SysLight" to the lay out node. This is a BoVeX offers the pack-boxes algo rithm, which can
cus tom mono spaced bitmap font that I made for be used to nicely justify text. It can also be used
this pa per us ing soft w are I wrote. It is part of th to dis trib ute para graphs into columns, by think -
Fixed erSys fam ily. [25] Func tions like b and it ap - ing of the para graphs as “words” (accept able to
ply bold and italic text sty les, but func tions can do break at any line, but bad to break near the start
any thing that you can do in a general-purpose pro - or end of a para graph) and the columns as “lines.”
gram ming lan guage. It could be used by the doc u ment au thor for other
pur poses, I guess. There are other ty po graphic fea-
Pri mops tures av ailable.
The other thing that objects are used for is inter - Most of the lay out of the doc u ment itself is by Bo-
facing with the run time that is execut ing the Bo- VeX code, which is either part of the stan dard li-
VeX byte code. There are about 50 dif ferent builtin brary or part of your doc u ment, de pend ing on how
pri mops that can be used by the BoVeX pro gram. am bitious you feel. The func tion main-text parses
This includes sim ple things like integer and float - the doc u ment lay out into para graphs and remov es
ing point ad di tion, but also heavy w eight op era- white space that is not really part of the text. It nor -
tions like “load and reg ister this collection of True - mal izes text prop erties across those para graphs
Type font files as a font fam ily” or “in voke the so that they can be ma nip u lated indi vid u ally. For
boxes-and-glue pack ing algo rithm with these pa ra - each para graph it uses the built-in get-boxes to
me ters.” The pri mops in the for mer cat egory work break the words into fixed-size boxes with ap pro -
nat u rally on sim ple base types, but the heavy - pri ate glue and hy phen ation (see the next two sec-
w eight ones need to be able to pass com pli cated tions), and then uses the pack-boxes rou tine to op ti-
tree-structured het ero geneous data betw een the Bo- mize their lay out. The height of result ing lines are
VeX byte code execu tor and the run time. It would mea sured, and spaced accord ing to the line spac -
be pos sible for the run time to con sume and create ing, then packed into columns. Once their final
BoVeX values like tu ples and lists, but this has two place ment is known, boxes become stick ers, which
prob lems: One, many types like list are de clared as are size less elements that only know their po sition
user code (in the BoVeX stan dard library); they are and con tents. In this w ay, the BoVeX ren der ing
not spe cial, and w e don’t w ant to make them spe - pipeline is itself a bit like a com piler: It trans forms
cial by inform ing the run time of them. Two, requir - programmer-written source lay out into for mat ted
ing spe cific rep resen ta tions at the run time bound - para graphs, then into boxes of known size, then
ary inhibits op timiza tion; for exam ple w e can nor - into stick ers of known po sition. At the end, it out -
mally an alyze the whole pro gram to flat ten data puts the doc u ment as a PDF.
struc tures or remov e record fields that are nev er
used. The run time typ ically uses obj to com mu ni- Any part of the ren der ing process can report
cate struc tured data. “bad ness,” by calling the emit-badness primop.
Nom inally, bad ness is mea sured in square
For exam ple, the internal-pack-boxes prim itiv e points of area that is out side of its con tainer.
runs the boxes-and-glue algo rithm. It takes some Worse situations—such as text ov erlap ping
lay out (which is expected to be a series of box other text—hav e their bad ness scaled up per the
nodes, with at trib utes giv ing their size, glue prop - same area of ty po graphic hor ror. Less serious
erties, and so on) and con figu ra tion pa ra me ters infractions—such as a little too much space be-
like the type of justification and algo rithm to use. tw een words—hav e bad ness scaled down. You
It returns an object with a new lay out (the boxes hav e to use your heart to tell you what these scal -
grouped into lines, with new glued up widths) ing factors should be.
as w ell as the total bad ness. Inside the BoVeX
lay out sup port code, this primop is wrapped as Fonts
pack-boxes with a na tiv e, typed inter face, so pro -
gram mers do not need to think about that im ple - BoVeX can ren der your doc u ment in plain Times
men ta tion de tail. Other ty po graphic features that Roman if you don’t care about any thing, or access
ben efit from run time sup port are im ple mented 13 other bor ing built-in PDF fonts, or it can load
any True Type font from font files. (They do not The de tails really keep go ing, too. The hy phen ation
need to be “in stalled,” and it won’t help to install dic tio nary is stored in a file called hyph-en-us.tex .
them. You just put them in the di rectory with your “hyph” here of course stands for hy phens, and
doc u ment.) It loads their kern ing ta bles and ap - “en-us” means “Eng lish (United States).” In fact it
plies kern ing prop erly, by gen erat ing rigid boxes is the stan dard lan guage code for US Eng lish in the
at the sub-word lev el with un break able glue. I w as Small Lan guage Model called IETF BCP 47.[28] But
dis ap pointed to find that most fonts include only then w e hav e “hyph-en”, which is a plau sible hy -
a few dozen kern ing pairs. They do this in or der phen ation of “hy phen”! You could even read it as
to “sav e space” in the font file, which is ut terly “hy phen us, tex”, as a request for TeX to hy phen -
rich com ing from some one that would try to sav e ate the words in this file. This is the kind of de tail
space inside of words by squeez ing letters together! I’m talk ing about! (There is also hyph-uk, which
In the cur rent font Palatino, the word “BoVeX” is for once sounds a little less dig nified than the US
not kerned cor rectly because the rare bigraph “oV” accent.)
does not hav e a kern ing pair. I hope to im prov e
this de tail in a fu ture version (per haps for the pre - Rephras ing
sum ably forth com ing video version of this pa per ).
And of course, BoVeX includes a facility for us ing
Hy phen ation the LLM to rephrase text so that it ren ders more
beau tifully.
Johannes Guten berg invented the hy phen in A.D.
1455 for his Guten berg Bible, then just known as In con trast to the algo rithm I de scribed for mono -
Bible. [26] His print ing process actu ally required the spaced text, it is not straight for w ard to know
lines to all be the same length, so he had to stick whether a pre fix of some text will pack neatly
these little guys all ov er the place. His hy phens with a pro por tional font. It de pends on all sorts
looked like this: . Later on w e straight ened these of con tin gen cies, like kern ing, whether w e will
out and de cided w e only needed one at a time, split mid-word and hy phen ate, or change fonts
and today w e use them not because w e require our mid-sentence, or include an in-line im age, and so
lines to all be the same length, but because w e like on. Un like mono spaced text, a line of pro por tional
the cog nitiv e chal lenge of remem ber ing the begin - text ba sically nev er fits exactly (bad ness 0); w e need
ning of the word while w e mov e our eyes to the be- to ap ply some glue to make it fit, which gen erally
gin ning of the next line while read ing. has some small cost even when the text looks great.
BoVeX sup ports hy phen ation us ing the same ap - One of the fid dliest parts of this is that w e can’t
proach as TeX: We break each word into boxes at just work with plain text, which is what the LLM
legal hy phen ation points, and mark these points as en joys best. Me too. This is because the para graph
sort-of-bad to break, and that if you do, you need being rephrased is some lay out value, which con -
to insert the hy phen char acter and use a little more tains some struc ture. Send ing the orig inal BoVeX
space. By de fault in BoVeX, the hy phen sticks out code for the para graph would maybe be pos sible
of the end of the line a little bit. This is actu ally a in prin ciple, although it would require very inva-
bug but I like it. siv e changes to the com piler, and for bid den ob-
scen ities like “ev al” to run the code it gen erated,
I use the same hy phen dic tio nary as TeX, which is and much bet ter error recov ery for the pre sum ably
clev erly rep resented as a pri or itized set of pat terns vig or ous stream of bro ken BoVeX code gen erated
in or der to fit com pactly in mem ory. [27] Again, by the LLM. So I didn’t try that. Instead, I gen er-
you hav e to respect Knuth and crew’s at ten tion ate a textual rep resen ta tion for the para graph to be
to de tail, although to be fair this algo rithm also rephrased, and feed that to the LLM. The prompt
dates to a time when stor ing a spell check dic - looks like this:
tio nary in a computer ’s mem ory w as de scribed
as “not feasible.” So some of this w as out of ne -
cessity. One of the nice things about the rep -
resen ta tion is that it gen eralizes to words that
w ere not in the 1974 Merriam-W ebster Pocket
Dictio nary. For exam ple it hy phen ates SIG-
BOVIK cor rectly.
Exercise in rephrasing text. The following para- ing process.
graph, which appears between <P> and </P>
tags, needs to be rephrased so that it retains its But, how do w e know whether w e hav e a good
precise meaning, but with minor variations in the rephras ing? When w e run the boxes-and-glue algo -
specific choice of words, punctuation, and so on. rithm, w e get a “bad ness” score for the paragraph’s
No new facts should be introduced or removed, line breaks, which tells us how bad the paragraph’s
and all the ideas from the original paragraph line breaks are. When w e run the rephras ing algo -
should appear. However, it is good to use syn- rithm, the prob ability of the text w e gen erated tells
onyms and change the word order and phrasing.
us how seman tically good it is, and so w e can call
The text contains markup as well. There are two 1 - p the seman tic loss. Com bin ing those two some -
types: <span class="c0">text goes here</span> how tells us how bad this is ov erall, and of course
and <img src="image.png">. These should be w e w ant to find a rephras ing that min imizes the
preserved in the rephrased text. <img> tags ov erall bad ness.
absolutely need to be retained and should not
change their sources, although it is permissible I wish that I could tell you that I solv ed this one
to move them around in the text. <span> should with a beau tiful algo rithm! But so far I just hav e
generally be retained, but the contents could some thing rea son able that works. I gen erate many
change. The classes of spans may not change, dif ferent rephras ings (with their seman tic loss),
and only the classes that appear in the original and run each of them through the boxes-and-glue
text may be used. algo rithm (to get the ty po graphic bad ness). I
choose the one that op timizes the pre ferred trade -
The first part is ba sically the same as what I used
off betw een seman tic loss and ty po graphic bad -
for the mono spaced version, except that I ask the
ness. This process is con trolled by BoVeX code (i.e.
LLM to de limit the para graph. This is im por tant
it is in the source code of this very pa per ) and so
so that I know when it thinks it’s done, and seems
it can be mod ified by the doc u ment au thor. Knuth
to work bet ter than look ing for new lines or the
has a very low tol erance for seman tic loss, and
end-of-stream token. The second part is new. I trans -
knows that his algo rithms pro duce good results
late the lay out into plain text where un inter preted
with out rephras ing. Lorem Ep som just w ants it to
sub trees are replaced with <img src="img1.png"> .
look good and sound good. Both hav e pub lished
These are gen erally boxes whose con tents are not
in SIGBOVIK 2024.
text. This could be an actual inline im age or lay -
out used to con trol ren der ing, like some bit of hor - How to gen erate many dif ferent rephras ings? The
izon tal space. Nodes that are used to set text prop - sim plest thing would be to sam ple ran domly, like
erties of the sub trees with at trib utes (like fonts, col- w e did for the mono spaced version. But since w e
ors , sizes , etc.) are trans lated into dis tinct classes and pre fer rephras ings that max imize prob ability, it
marked up with <span class="c0">...</span> . is bet ter to explore them sys tem at ically. Con sider
The LLM has seen plenty of HTML, so it’s able to the model at the end of the prompt to be the root
use these rea son ably w ell. of an infinite tree. Each node in the tree rep resents
an LLM state (sequence of pre vious tokens) and its
After gen erat ing a rephras ing, I parse the out put
chil dren are the pos sible next tokens. Each of these
HTML and match it up with the orig inal lay out.
tokens has a prob ability. All the model does is al-
If I find any bro ken HTML, it is rejected. If I find
low us to access that prob ability dis tri bu tion for
any <img> tag referenc ing a src not in the orig inal,
a node. Each pos sible rephras ing is a path in this
it is rejected. If I find any <span> tag referenc ing a
tree that ends with </P>. We begin by sam pling the
class not in the orig inal, it is rejected. The more
most likely (as far as w e know) path: At each node
com plex ity that the orig inal lay out has, the higher
w e see, w e take the first (most prob able) token.
the chance of a rejection, but rephras ing gen erally
This is our first rephras ing, and it usu ally matches
suc ceeds. But rejecting sam ples slows us down, so I
the orig inal text exactly. Say that w e “skipped”
leav e off the second part of the prompt in the com -
prob ability mass if w e sam pled a token that is less
mon case that the input para graph is plain text.
prob able than it. We com pute the seman tic loss as
That w ay the LLM doesn’t even try us ing markup.
the av erage prob ability mass skipped ov er all the
With the HTML and orig inal lay out matched up, tokens in the path. For this first path, w e alw ays
BoVeX can recon situte the lay out with the new took the most prob able token, so this is 0.0 by de -
rephrased text. This pre serv es any nested lay out finition.
and at trib utes. It then con tin ues with the ren der -
The next path w e explore will di verge from this expo nen tially many paths starts out with im prob -
path at some node (maybe the root). We pick a able tokens but then ends with a mir acle streak of
node that is likely to result in a good final loss, by prob able tokens). But it can certainly be more sat -
scor ing each node in the tree. The score is the av er- sifying. Knuth would not stop here (but this is an
age prob ability of all an cestor nodes times the prob - Any% Knuth speedrun).
ability of the next highest-probability token that
w e hav e not yet explored. The node with the high - Instead I spent my time im ple ment ing an achiev e-
est ov erall score is the one w e expand, by choos ing ment sys tem in BoVeX. The first time certain con di -
that next highest-probability token. We are now in tions are met, the sys tem per ma nently aw ards you
an un explored part of the tree, and so w e sam ple an achiev ement and prints a nice color tro phy on
the most prob able nodes repeat edly un til w e reach your ter mi nal. For exam ple, you can get the “Not
</P>. Speak ing of which, BoVeX has a heck of a bad” achiev ement for gen erat ing a doc u ment that
time try ing to rephrase these last few para graphs is at least 5 pages and has less than 1000 bad ness
because they literally con tain the text </P> in them. per page.
The scores should be seen as heuris tic; w e would Ad van tages of rephras ing
get dif ferent results by choos ing dif ferent w ays
of com put ing the score. This is an exam ple of a An other nice thing is that the man ual rephras ing
“beam search” algo rithm, which is good because that con sumes valu able brain sug ars when writ ing
it con nects this project again to Super Metroid. As can become op tional. For exam ple, when I wrote
de scribed in the ear lier excerpt from the speedrun the open ing para graph of this pa per and listed a
doc u ment that inspired this work, one of the final variety of triv ial de tails, I might not need to think
things you do in that game is acquire the “hy per of dif ferent w ays to say “un con cerned.” I could
beam” to de feat Mother Brain. just write “un con cerned” each time and let the ty -
po graphic con sid erations de ter mine which syn -
Since w e will run the boxes and glue algo rithm onym to use each time.
on mul tiple related texts, I gen eralized that algo -
rithm to work on tree-structured input. This is Con clu sion
clean; the memo ta ble keeps the same di men sions,
but records an ad di tional fact. Now w e store the In this paper—and with this paper—I pre sented Bo-
penalty, whether to break after this token, and VeX,a new com puter type set ting sys tem. It follows
what the best sub tree is. We hav e to con sult each the tra di tion TeX, but with mod ern ameni ties such
sub tree when com put ing the score for a node, but as requir ing ov er 128 giga bytes of RAM. Though
this does not affect the asymp totic run time. The ta - some may con sider the ad di tion of AI features to
ble size is still at most O(n 2), and although w e ex- TeX to be an un nec essary per version, I find this
plore more chil dren per node, branches in the tree use of LLMs to be fully justified.
reduce the max imum depth to the root, which ac-
tu ally reduces one of the factors of n to log( n) as Fu ture work
the tree becomes com plete. How ever, as the SIG-
BOVIK dead line crept upon us, I nev er actu ally Ty po graphic fea tures . Many more ty po graphic
hooked this func tion ality up. It would require ad - features are de sir able. Foot notes! It is so hard to
di tional (pro gram ming) work to merge the trees, write a pa per with out foot notes. Where am I sup -
and the lay out process is so fast that it doesn’t mat - posed to put the bonus di gres sions? The lay out
ter; I can eas ily run the full lay out algo rithm on of foot notes is tricky and should be part of a gen -
hun dreds of rephras ings per para graph. eral float ing figure im ple men ta tion. End notes are
actu ally easy, but I don’t w ant end notes. I w ant
I would like to im prov e the algo rithm, because it them to be little foot notes so that you can’t help
does seem like there should be a w ay to integrate but read them.
the boxes-and-glue dy namic pro gram ming algo -
rithm with the path exten sion algo rithm so that BoVeX does not sup port page num bers, which is
w e pri or itize explor ing nodes that are likely to gen - good because they are for bid den by the SIGBOVIK
erate the best bal ance of ty po graphic and seman - pro gram com mit tee.
tic qual ity. It won’t be as sat isfyingly op timal as
boxes-and-glue itself because w e hav e incom plete TeX is famous for its math emat ical type set ting
infor ma tion (w e nev er know whether one of the as w ell. It would fit neatly into BoVeX in the
same w ay, since both use the same fun da men - mu ta tiv e, or other prop erties you’ d like), inference
tal boxes-and-glue en gine. BoVeX does not hav e can some times gen erate dif ferent an sw ers due to
“macros” or “modes” like TeX, but it would work float ing point round-off error. [30] Alas, these are
cleanly to write a BoVeX func tion math (or, if you not even nec essar ily related to the final prob abil-
like, $) that parses a cus tom syn tax. In fact it would ities in the model, as billions of non-linear op era-
be nat ural to hav e dif ferent parsers for dif ferent tions hap pen within the hid den lay ers of the net -
maths, so that you don’t need to parse -> as minus work. The effect is not par ticu larly grav e; w e might
greater than in math emat ical con texts that don’t use miss out on a highly likely path because the prob -
mi nus or greater than at all. ability dis tri bu tion w as dif ferent the second time
w e looked at it. There are already lots of w ays w e
Op ti miza tion . There are many op por tu nities to might fail to find highly likely paths, so this is not
make BoVeX code faster. This is mostly im por tant some kind of repro ducibil ity crisis. It is mostly just
for when it is being run in a loop in or der to try a bit un sat isfying.
out many dif ferent rephrased texts. (That said, I
do not wish to pre clude what could be done with Uni code sup port . This would hav e been help ful
BoVeX by assum ing its execu tion is do ing only when abov e I de cided to show you Gutenberg’s
type set ting tasks. For exam ple, shouldn’t you be funny hy phen, , for which I had to set tle for em -
able to chal lenge your paper ’s review ers to a game bed ding a crappy hand-drawn PNG file. Instead
of chess against a strong en gine em bed ded within I could hav e used U+2E17, which since this exotic
your doc u ment?) The first thing to fix is that it ma - code point it is not present in the font Palatino, you
nip u lates too many strings at run time (e.g. the could hav e expe rienced as . BoVeX is wit ten with
code, record labels, object fields, and “reg isters”). some Uni code sup port, with the main excep tion
This is easy to fix since these are all known at com - being that the PDF out put code only sup ports the
pile time. There are lots of high-lev el op timiza tions em bar rass ingly diminu tiv e WinAn siEn cod ing. [31]
left to do for the IL code (com mon subex pres sion
elim ina tion, con stant ar gu ment remov al, un cur ry - Dead lines . Although BoVeX itself is very fast,
ing, etc.) and lots of peep hole and control-flow op - rephras ing is very slow. This presents a prob lem
timiza tions left to do for the byte code (cur rently for the typ ical w ay that aca d emic pa pers are writ -
no op timiza tions are per formed at all). All of this ten, which is to do all the work in a coffee-fueled
becomes more im por tant if I add an other planned fugue in the last few days before the dead line, then
feature, which is the abil ity for the doc u ment to stay up all night writ ing the pa per and find ing ci-
be glob ally op timized by ap ply ing a black-box op - ta tions for the pro-forma “re lated work” section
timizer to a set of user-specified pa ra me ters. For which you did last but you know that the review -
exam ple, the column width, line spac ing, or font ers will insist upon, and tw eak ing \vspace and
size could be tw eaked to make the doc u ment fit \begin{figure}[h!] un til it fits within the page
bet ter. This feature is “Auto-Margin Plus.” Things limit. On the one hand, BoVeXdoes po ten tially free
are already set up to do this pretty straight for - the au thor from the visual tw eak ing process. But on
w ardly; w e would sim ply gen erate the doc u ment the other hand, the LLM inference for the rephras -
ov er and ov er while search ing ov er the pa ra me ter ing process can be quite slow, and it can take many
space, and choose the one with the least bad ness. hours or days to fully bake a long pa per! For this
This may also affect which rephras ings look best. rea son, it may be bet ter to change con ference dead -
But instead I spent my pre cious time im ple ment - lines to a sys tem where the pre-rephrasing text is
ing 3D text .[29]
3D text sub mit ted. The pub lish ers (what do they even do?)
can be the ones to execute the rephras ing in the
Re pro ducibil ity . The algo rithm for reprhas ing cloud as they pro duce the “camera-ready copy.”
text tries to find the best place to explore the next With straight for w ard exten sions, this would also al-
most likely token from the prob ability dis tri bu tion. low the rephras ing to adapt to changes in the ov er-
This expects the gen eration of these dis tri bu tions all vol ume sty le, or to ad just to avoid em bar rass -
to be de ter min istic. Math emat ically, inference is ing ty po graphic con cidences with other ar ticles in
de ter min istic (it is just a bunch of ma trix mul tipli - the same vol ume (such as us ing the same no ta tion
cations), so this “should work.” But in prac tice the with a dif ferent mean ing). In prin ciple, the pa per
enor mous calcu lation is per formed in an un pre - could edit itself to respond to feed back from re-
dictable or der as it is executed in par allel (in mul - view ers, in a w ay that min imizes the seman tic dis -
tiple CPU and GPU cores). Because float ing point tance from the orig inal. This rapid feed back loop
arith metic is not associativ e (or dis trib u tiv e, com - could reduce the time to pub lication, per haps to
mere months, or even w eeks! ber 2006.
Other w ays to min imize bad ness . The BoVeX [7] You can just go to arxiv.org and click on any
sys tem allows the doc u ment au thor to exchange se- ran dom ar ticle these days .
man tic con sistency for higher qual ity ty pog ra phy.
Although w e achiev e state-of-the-art results, there [16] N Bijlage . "Knuth meets NTG mem bers ". NTG:
are likely points that are more Pareto-efficient MAPS , 16. March 1996. pp. 38–49.
than what BoVeX can reach. BoVeX uses one of
the most pow erful pub licly av ailable LLMs, but [3] Russ Bynum . "Bernie Sanders w ants the US to
that model is lim ited to rewrit ing the text within adopt a 32-hour work w eek. Could work ers and
nar row con straints. Irrespon sible research has com pa nies ben efit?". March 2024.
demon strated that lan guage mod els are capa ble
of vo lition, tak ing actions and us ing tools to ac- [12] https: / / github.com / ggerganov / llama.cpp . gger -
com plish goals. With mi nor mod ifications, it is anov . March 2024.
likely pos sible to expand the Pareto fron tier of the
semantic/typographic trade off. For exam ple, some - [26] Johann Guten berg . "Bible". 1455.
times w e could im prov e the ty po graphic qual ity
of the text with out any seman tic loss, by act ing on [4] https: / / allfamousbirthday.com / donald-knuth / .
the world to make the reworded text true . Hu man au - Febru ary 2024.
thors do this already: Ear lier when I w as de scrib -
[23] https: / / ocaml.org . "The Ob jectiv e Caml sys -
ing internal-pack-boxes , rather than explain the
tem ". 2023.
some what awk w ard im ple men ta tion, I w ent back
and changed the already-working code so that [21] Gra ham Hut ton . "Higher-order func tions for
it would serv e as a sim pler exam ple of how pri - pars ing ". Journal of functional programming, 2(3).
mops use obj , but still be truth ful. Now imag ine 1992. pp. 323–343.
the dif ficulty in type set ting a state ment like “The
uni verse con tains ap prox imately 1,000,000,000 [5] Don ald E Knuth . "The Art of Com puter Pro -
pa per clips,” and how much more beau tiful gram ming: Volume 4A, Com bina torial Algo rithms
the text could be if that num ber w ere instead Part 1". Addison-Wesley. Janu ary 2011. 912 pages .
10,000,000,000,000,000,000,000,000,000,000,000,000!
[6] Don ald E Knuth . "The Art of Com puter Pro -
In the mean time there is an eas ier w ay to get zero gram ming: Volume 4B, Com bina torial Algo rithms
bad ness: Delete the whole doc u ment! As a wise Part 2". Addison-Wesley. October 2022. 736 pages .
per son once said, “If you can’t say some thing with
nonzero ty po graphic or seman tic loss, don’t say [15] Don ald E Knuth, Michael F Plass . "Break ing
any thing at all.” para graphs into lines ". Software: Practice and Experi-
ence, 11(11). No vem ber 1981. pp. 1119–1184.
Acknow ledge ments . Sup pos ing his name sur -
viv es rephras ing, I’d like to shout out to one of [27] Franklin Mark Liang . "Word Hy-phen-a-tion
my ad visors, Karl Crary. 20 years ago, he set out by Com-put-er ". 1983.
with me on an ill-advised and ill-fated at tempt to
replace LaTeX with an SML-like lan guage mT eX, [17] Robin Milner, Mads Tofte, Robert Harper,
which com piled into TeX macros. The nest ing David Mac Queen . "The de finition of Stan dard ML
square brack ets syn tax w as Karl’s idea, and BoVeX (Revised) ". MIT Press. May 1997. 114 pages .
shares genetic ma terial with mT eX for sure.
[8] John C Mitchell . "Type Systems for Pro gram -
See you next mis sion, ming Lan guages ". Van Leeuw en, Jan, ed . Formal
Models and Semantics. 1990. pp. 365–458.
Tom 7
[1] Tom Mur phy VII. "Bad ness 0 (Epsom's version) ".
SIGBOVIK. April 2024. 14 pages .
Bib li og ra phy
[30] Tom Mur phy VII. "GradIEEEnt half de cent ".
SIGBOVIK. March 2023. pp. 33–56.
[31] Adobe . "PDF reference: Sixth edi tion ". Octo-
[18] Tom Mur phy VII. "Modal Types for Mo bile Fu, Brian Fuller, Cyn thia Gao, Vedanuj Gosw ami,
Code ". Janu ary 2008. Na man Goy al, An thony Hartshorn, Saghar Hos -
seini, Rui Hou, Hakan Inan, Marcin Kar das, Viktor
[13] Tom Mur phy VII. "NaN gates and flip FLOPS". Kerkez, Ma dian Khabsa, Isabel Kloumann, Artem
SIGBOVIK. April 2019. Korenev, Punit Singh Koura, Marie-Anne Lachaux,
Thibaut Lavril, Jeny a Lee, Diana Liskovich, Ying -
[10] Tom Mur phy VII. "The First Level of Super hai Lu, Yun ing Mao, Xavier Mar tinet, Todor Mi-
Mario Bros. is Easy with Lexicographic Or der ings hay lov, Pushkar Mishra, Igor Moly bog, Yixin
and Time Trav el. After that it gets a little tricky ". Nie, An drew Poul ton, Jeremy Reizen stein, Rashi
SIGBOVIK. April 2013. pp. 112–133. Rungta, Kaly an Saladi, Alan Schelten, Ruan Silva,
Eric Michael Smith, Ran jan Sub ra man ian, Xiao -
[20] Tom Mur phy VII. "The Wizard of TILT: Effi- qing Ellen Tan, Binh Tang, Ross Taylor, Ad ina
cient(?), Con venient and Abstract Type Rep resen ta - Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan,
tions ". Carnegie Mellon tech report CMU-CS-02-120 . Iliyan Zarov, Yuchen Zhang, An gela Fan, Melanie
March 2002. Kam badur, Sha ran Narang, Aurelien Rodriguez,
Robert Stojnic, Sergey Edunov, Thomas Scialom .
[29] Tom Mur phy VII. "The glEnd() of Zelda ". SIG- "Llama 2: Open foun da tion and fine-tuned chat
BOVIK. April 2016. pp. 105–112. mod els". ArXiv.org. July 2023.
[14] Tom Mur phy VII. "ZM~~ # PRinty# C with
ABC!". SIGBOVIK. April 2017. pp. 129–148.
[11] Hugo Tou vron, Louis Mar tin, Kevin Stone, Pe-
ter Albert, Am jad Alma hairi, Yasmine Babaei, Niko -
lay Bash lykov, Soumy a Batra, Pra jjwal Bhar gav a,
Shruti Bhos ale, Dan Bikel, Lukas Blecher, Cris t-
ian Can ton Ferrer, Moy a Chen, Guillem Cu cu rull,
David Esiobu, Jude Fernan des, Jeremy Fu, Wenyin