-
Notifications
You must be signed in to change notification settings - Fork 0
Decoding Hoon
These are the notes from abliss's attempt to decipher the Hoon language. It is current as of commit 2ea195, but the code samples below are not live (can github wiki do that?), so this document will quickly fall out of date. I still haven't figured it all out, so please contribute if you know more than I do or you have ideas for a better presentation.
I will try to avoid putting in too many spoilers, mostly by writing this as I go. But for maximum enjoyment of the puzzle, you should try to figure it out without any help.
Over in the sidebar I'll be collecting a "cheat sheet" of syntax rules, in the order I figure them out. Each symbol or digraph is linked to the place in this document where I explain what I think it does.
Hoon has scarce documentation, so it would seem the best way to learn Hoon is to understand its compiler, hoon.hoon. By examining its form and its content, we will gradually discover Hoon's alien syntax.
The best place to start is probably section 2eF, parsing (ascii)
, which will teach you the three-letter-name for each ascii symbol. Like the docs say, you just need to memorize these. A set of flashcards would not be a bad idea. In case it helps, I've relisted them below in a different order, with some mnemonics added in the comments (which are clearly supposed to be set off by ::
):
++ ace (just ' ') :: spACE
++ gal (just '<') :: GreAter Left
++ gar (just '>') :: GreAter Right
++ pel (just '(') :: ParEnthesis Left
++ per (just ')') :: ParEnthesis Right
++ kel (just '{') :: curly bracKEt Left
++ ker (just '}') :: curly bracKEt Right
++ sel (just '[') :: SquarE bracket Left
++ ser (just ']') :: SquarE bracket Right
++ doq (just '"') :: DOuble Quote
++ soq (just '\'') :: Single (O) Quote
++ col (just ':') :: COLon
++ sem (just ';') :: SEMicolon
++ com (just ',') :: COMma
++ dot (just '.') :: period looks like a DOT
++ tec (just '`') :: backT(E)iCk
++ sig (just '~') :: ?people use tilde in their SIGnature? or ?SquIGggle?
++ zap (just '!') :: onomatopoeia. often followed by bang character, or the word "bang", in comics.
++ pat (just '@') :: (P)AT sign
++ hax (just '#') :: ?HAsh tag (X)?
++ buc (just '$') :: BUCk is american slang for dollar
++ cen (just '%') :: perCENt sign
++ ket (just '^') :: (K)carET? or looks like a rotated version of what physicists call the bra-KET?
++ pam (just '&') :: AMPersand (jumbled so as to be consonant-vowel-consonant)
++ tar (just '*') :: sTAR
++ hep (just '-') :: HyPhEn (jumbled so as to be consonant-vowel-consonant)
++ cab (just '_') :: ???
++ lus (just '+') :: pLUS sign
++ tis (just '=') :: old english contraction for "it is", connoting equality
++ bar (just '|') :: pipe looks like a vertical BAR
++ bas (just '\\') :: BAckSlash
++ fas (just '/') :: ForwArd Slash
++ wut (just '?') :: lolcat misspelling of "what?", connoting interrogative
By the way, if we didn't already know it, we can start to get some hints about basic hoon syntax: ++
is used to define a parsing-unit, whatever that is; just
is somehow used to refer to a single character; '
is used to quote text; and ()
is used as in Lisp to apply a function to an argument.
After this section things get confusing, but if we skip down a bit there are some more good hints. Let's take a look at section 2eH, parsing (idioms)
. Again, I've reordered it and added mnemonics in the comments.
++ hig (shim 'A' 'Z') :: HIGh/upper case letters
++ low (shim 'a' 'z') :: LOWer case letters
++ nud (shim '0' '9') :: NUmerical Digits
It seems that shim
defines a range of ascii characters by taking its
start and end, inclusive.
++ alf ;~(pose low hig) :: AL(F)phabetical letters, any case
++ aln ;~(pose low hig nud) :: ALphaNumeric letters/numbers
++ alp ;~(pose low hig nud hep) :: ALphanumerics, possibly hyPhenated
Apparently we can use pose
to comPOSE multiple parsing-units in a
logical union. The ;~
is still a mystery.
++ dit (cook |=(a=@ (sub a '0')) (shim '0' '9')) :: Decimal digIT
++ but (cook |=(a=@ (sub a '0')) (shim '0' '1')) :: Binary (U) digiT
Here we encounter the cook
construct. This is some kind of mapping operation
which serves to convert the ascii character range into its semantic values. It
uses sub
to SUBtract the base ascii value of '0'
, which is 48, from each
ascii value in the range. In these lines, a
appears to be a bound variable in
something like a lambda construct (perhaps this is what |=
is for).
++ hit ;~ pose :: Hexadecimal digIT
dit
(cook |=(a=char (sub a 87)) (shim 'a' 'f'))
(cook |=(a=char (sub a 55)) (shim 'A' 'F'))
==
This is the first multi-line parsing-unit, but we've seen all its parts
before. This is the "tall form" of the ;~(pose x y z)
construct we met above;
==
is the terminator for the list.
Combining our previous knowledge, we can see how this builds the parsing-unit
for a hexadecimal digit. Like the decimal and binary digits, this converts an
ascii character to a semantic numerical value. It uses pose
to cover the three
possibilities of the character: either it is a decimal digit with its base-10
semantic value (covered by dit
), or a lower-case letter in the range a-f with
its ascii value minus 87 (which is the ascii value of 'a'
minus 10), or an
upper-case letter in the range A-F with its ascii value minus 55 (which is the
ascii value of 'A'
minus 10).
Note that the lambda construct in the cook
used here differs slightly from the
previous one: it has char
instead of @
. What could that mean?
Also, note the lovely way these parsing-units fit together. Each parsing-unit
neatly serves two purposes: defining exactly which characters it accepts, and
assigning each of those characters a semantic value. pose
manages to combine
both purposes of all its arguments to build a larger parsing-unit.
++ mes (cook |=([a=@ b=@] (add (mul 16 a) b)) ;~(plug hit hit)) :: ???
I have no clue what mnemonic inspired the name mes
. But if hit
is a single
hexadecimal digit, this must assemble two of them into a byte value using add
and mul
to do the arithmetic. Also, we see plug
, which seems to take two
parsing units and join them together in some way that's compatible with the |=
lambda construct (which we now know how to write with two arguments using []
).
The arguments are back to using @
instead of char
. Thinking carefully about
the semantics of these last three lambdas, the meaning starts to become
clear. We use char
to declare that we are accepting a CHARacter and computing
from its ascii value. We use @
to declare that we are accepting an ATom and
computing from its numerical value (recall that hit
has already translated
from ascii to a number). So these must be type declarations.
++ vit :: ???
;~ pose
(cook |=(a=@ (sub a 65)) (shim 'A' 'Z'))
(cook |=(a=@ (sub a 71)) (shim 'a' 'z'))
(cook |=(a=@ (add a 4)) (shim '0' '9'))
(cold 62 (just '-'))
(cold 63 (just '+'))
==
Okay, this is another pose
composition of multiple cases, in tall form. The
first three cover alphanumerics like before, but now they are cook
ed into
different values. A-Z are mapped to 0-25; a-z are 26-51; 0-9 are 52-61. With two
more characters -
and +
to serve as 62 and 63, it looks like we've got
ourselves a base64 digit (but this isn't any of the standard encodings). And
what's cold
? It appears to take a parsing-unit and simply change its output
value. Why "cold"? Just guessing, but, this operator is a bit like the K
combinator (K for the word
"Constant" in German), which "hOLDs a value for later use"--(but note that it
keeps its argument's "range of acceptable inputs" and only throws away its "map
to semantic value").
++ bet ;~(pose (cold 2 hep) (cold 3 lus)) :: ???
Based on what we know so far, this will accept a +
character and map it to 3,
or a -
character and map it to 2. No clue what "bet" stands for, or why these
values were chosen. Perhaps this will be used for parsing the sign of a
numerical literal? Indeed, 2 and 3 in the Nock virtual machines are the basic
left/right selection operators, and therefore can act like booleans.
++ gul ;~(pose (cold 2 gal) (cold 3 gar)) :: ???
Similarly, gul
will map a <
to a 2, and a >
to a 3. That would be useful
for parsing XML-like tags.
++ poy ;~(pfix bas ;~(pose bas soq mes)) :: ???
The inner pose
is going to take a \
, a '
, or a two-character hex
byte; this is preceded by a \
. This looks useful for parsing an escaped
entity inside a single-quoted string literal. That means pfix
must create
a PreFIX parse-unit from two parse-unit arguments.
++ qit ;~(pose (shim 32 38) (shim 40 91) (shim 93 126) (shim 128 255) poy)
Looking at an ascii chart, these ranges define the printable characters minus 39
('
) and 92 (\
). So these are the Quo(I)Table characters, and with the
addition of poy
this parsing-unit can represent anything found inside a
single-quoted string.
++ bin (bass 2 (most gon but)) :: BINary number
++ dem (bass 10 (most gon dit)) :: DEciMal number
++ hex (bass 16 (most gon hit)) :: HEXadecimal number
Okay, clearly these parsing-units must assemble binary, decimal, and hexadecimal
digits into numbers. (most gon but)
must mean "take the MOST characters you
can which match but
, and return a list of their parsed values". bass
must
take a radix r
and a list of numbers in 0..r-1
, then convert them from a
sequence of base-r digits into the total semantic value. Presumably bass
is a
pun on "BASS", which in music is pronounced like the word "base". (Why isn't
there one for base64?) The purpose of gon
is not yet clear.
++ qut (ifix [soq soq] (boss 256 (more gon qit))) :: QUoTed string
This probably works similarly to the above. I assume more
works like most
(what's the difference?) and eats all quotable characters. boss
must work like
bass
(what's the difference?) and converts the string of byte values into a
single atom (big-endian or little?) Recalling that soq
means '
, it's clear
that ifix
must mean InFIX: it takes a list of two parsing-units to serve as
the delimiters, then performs its third argument on the contents. Again, what's
that gon
for?
++ nix (boss 256 (star ;~(pose aln cab))) :: uNIX ?
This is another way to build an atom out of a string of characters, but it only
accepts alphanumerics (aln
) and underscores (cab
). Maybe this is for
restricting to strings that could be safe names on a unix filesystem. What's
star
? Could that be a reference to regular expressions, requiring zero or more
of its argument? Oh, that might explain the difference between more
/boss
and
most
/bass
: the former should allow for empty lists (presumably assigning the
value 0); the latter shouldn't. This means the empty string ''
would parse to
the atom 0. (I guess the 'o's in "more" and "boss" are supposed to remind us
that 0 is a possible value.)
There are only three more parsing-units in this section: sym
and ven
, which
look complicated, and gon
, which depends on gay
which is in the whitespace section.
Turning our attention to section 2eG, parsing (whitespace)
, we see some
now-familiar cosntructs:
++ dun (cold ~ ;~(plug hep hep)) :: ???
++ duq (cold ~ ;~(plug tis hep)) :: ???
++ duz (cold ~ ;~(plug tis tis)) :: ???
These parse the strings --
, =-
, and ==
, evaluating all of them to the same
value which is represented by ~
. We don't know what that value is, yet.
++ vul (cold ~ ;~(plug col col (star (shim 32 126)) (just ^-(@ 10))))
According to our theories, this should match two colons, followed by zero or
more printable characters, followed by... well, that's presumably ascii 10, the
newline character. This must be for parsing comments, which suggests the
mnemonic is VULgate. This also suggests the meaning of cold ~
as giving the
characters zero semantic value, or possibly meaning they must come at the end of
a line or expression: recall that we saw ==
before, on a line by itself, as
the tall-form list-terminator.
++ gap (cold ~ (plus ;~(pose vul (mask [^-(@ 10) ' ' ~]))))
The inner pose
will match a comment or... well, seemingly either a newline, or
a space, or maybe whatever ~
means. So mask
can take a list of ascii values
and put them together into a parsing-unit, sort of like shim
but not
necessarily continuous. It's probably named for bitMASK, a quick way to test
inclusion against a predetermined set.
If we're right that star
is named for the regular expression mark indicating
"zero or more of these", perhaps plus
is named for the corresponding mark
indicating "one or more of these". Whatever is found, the cold
zeros out its
semantic meaning, so gap
seems to represent the GAP between two expressions.
++ dog ;~(plug dot gay) :: ???
++ doh ;~(plug ;~(plug hep hep) gay) :: ???
++ gon ;~(pose ;~(plug bas gay fas) (easy ~)) :: ???
++ gay ;~(pose gap (easy ~)) :: ???
So dog
matches a .
followed by whatever gay
matches. doh
, similarly,
matches --
followed by gay
. But note that it isn't written ;~(plug hep hep gay)
, nor ;~(plug dun gay)
, both of which ought to match the same things. The
latter would throw away the meaning of the --
; the former would combine their
meaning with the meaning of gay
. Whatever doh
is used for it, it must
require the meaning of --
and the meaning of gay
separately.
gon
is leftover from the idiom section, but I tossed it in here because it
seems to fit. Its plug
is a \
, a gay
, and a /
. While scrolling around in
the file, I did notice this in section 2cE
:
^= sis
'dozmarbinwansamlitsighidfidlissogdirwacsabwissib\
/rigsoldopmodfoglidhopdardorlorhodfolrintogsilmir\
...
/nimlarfitwalrapsarnalmoslandondanladdovrivbacpol\
/laptalpitnambonrostonfodponsovnocsorlavmatmipfap'
So maybe gon
is used to break up a long string literal onto multiple lines,
and gay
is something like "a newline followed by an indent", which is
supported by the presence of gap
in its pose
. But (easy ~)
is still a
mystery.
The next section that looks simple enough to tackle is 2eL, formatting (path)
. From the existing docs we know a bit about what paths look like, so
maybe this won't be too bad.
++ sid (cook |=(a=@ (sub a '0')) (shim '0' '9'))
++ sib (cook |=(a=@ (sub a '0')) (shim '0' '1'))
Thes are identical to dit
and but
above. Why the duplication?
++ sed (cook |=(a=@ (sub a '0')) (shim '1' '9'))
++ seb (cold 1 (just '1'))
Decimal/Binary digits Excluding 0. Maybe for the leading character of a numerical literal?
++ sox (cook |=(a=@ (sub a 87)) (shim 'a' 'f'))
++ six ;~(pose sid sox)
++ sex ;~(pose sed sox)
This is just like hit
except it only accepts lower-case.
++ sov (cook |=(a=@ (sub a 87)) (shim 'a' 'v'))
++ siv ;~(pose sid sov)
++ sev ;~(pose sed sov)
Looks like a base-32 digit.
++ sow ;~ pose
(cook |=(a=@ (sub a 87)) (shim 'a' 'z'))
(cook |=(a=@ (sub a 29)) (shim 'A' 'Z'))
(cold 62 (just '-'))
(cold 63 (just '~'))
==
++ siw ;~(pose sid sow)
++ sew ;~(pose sed sow)
Apparently even the Martians aren't immune to the law that every software project must contain multiple incompatible base64 encoders.
By now the pattern for these names should be pretty obvious: first character s;
second character 'o' for extended digits only, 'i' for all digits, 'e' for
digits Excluding 0; last character d for Decimal, b for Binary, x for heX, v for
base32 (highest digit is v
), w for base64.
++ voy ;~(pfix bas ;~(pose bas soq bix))
voy
is very much like the escape-processor poy
above, except instead of
using mes
(two hex digits encoding a byte), this uses bix
:
++ bix (bass 16 (stun [2 2] six))
So presumably bix
uses six
like mes
uses hix
, i.e. it's a lowercase-only
two-character hex byte. Instead of manually assembling it with cook
, add
,
and mul
, this one makes use of bass
and the new stun
, which presumably
allows a specific number of repeats of its arguments like {2,2}
in a
regex. Here are a bunch more examples:
++ qib (bass 2 (stun [4 4] sib))
++ tid (bass 10 (stun [3 3] sid))
++ qix (bass 16 (stun [4 4] six))
++ piv (bass 32 (stun [5 5] siv))
++ piw (bass 64 (stun [5 5] siw))
++ til (boss 256 (stun [3 3] low))
++ qeb (bass 2 ;~(plug seb (stun [0 3] sib)))
++ ted (bass 10 ;~(plug sed (stun [0 2] sid)))
++ qex (bass 16 ;~(plug sex (stun [0 3] hit)))
++ pev (bass 32 ;~(plug sev (stun [0 4] siv)))
++ pew (bass 64 ;~(plug sew (stun [0 4] siw)))
Again, the last character tells us the base, and the middle character whether it may begin with a zero. The first character seems to say something about the length.
Down a little lower, something odd jumps out: a new definition of qut
. Here it
is side-by-side with the old one:
++ qut (ifix [soq soq] (boss 256 (more gon qit))) :: v1, above
++ qut %+ ifix [soq soq] :: v2
%+ boss 256
%- star ;~ pose
;~(pfix bas ;~(pose bas soq bix:ab))
;~(pose (shim 32 38) (shim 40 91) (shim 93 126))
==
There's a new digraph here, %+
-- how does it differ from the ;~
we've been
seeing? And there's also bix:ab
; how is that different from just bix
? v2
seems to differ from v1 in a couple ways: it doesn't allow ascii chars 128-255;
and by using bix
instead of the mes
that appears in the poy
used by qit
,
it allows only lower-case hex in the escape chars.
This section also includes a second sym
, which we should compare to the sym
we never looked at in the "idiom" section above. But it depends on these two, so
let's check them out first:
++ vym (bass 256 ;~(plug low (star ;~(pose low nud))))
++ vyn (bass 256 ;~(plug hep vym (easy ~)))
vym
is encoded like nix
above, but must match the regex
[a-z][a-z0-9]*
. vym
looks like a Negative version of vym
since it's
preceded by a hyphen, and is terminated by that strange (easy ~)
thing. These
are only used by the new sym
:
++ sym ::v1
%+ cook
|=(a=tape (rap 3 ^-((list ,@) a)))
;~(plug low (star ;~(pose nud low hep)))
++ sym (cook |=(a=(list ,@) (rap 3 a)) ;~(plug vym:ab (star vyn:ab))) ::v2
This parallelism illustrates the purpose of %+
: it creates a function-application-list like ()
, but in tall form (over multiple lines). We still don't understand :ab
or the (easy ~)
in the vyn
. But if we just ignore those, it would seem that both versions
of sym
accept the same things: a lowercase letter, followed by zero or more
lowercase, numerical, or hyphen characters. The two versions then cook
up
values in a slightly different way, though they look pretty similar. In v2, the
argument to the lambda construct has type (list ,@)
, which might mean a list
of atoms; in v1, the argument is tape
, but looks like it's later cast to a
(list ,@)
--this is reminiscent of java's varargs. In both cases we rap 3
the
list of arguments. I wonder what that does? It's used again here:
++ siq ;~ pose
(shim 'a' 'z')
(shim 'A' 'Z')
(shim '0' '9')
hep
(cold 32 dot)
;~(pfix sig ;~(pose sig dot bix))
==
++ tyq (cook |=(a=(list ,@) (rap 3 a)) (plus siq:ab))
siq
seems pretty straightforward. It accepts the chars [a-zA-Z0-9]
, plus
-
, and takes them at their ascii codepoints. It also accepts .
but changes
its code from 46 to 32 for some reason. Finally, it can use ~
to escape a .
,
another ~
, or a lowercase hex byte. tyq
takes a list of one or more siq
,
with that mysterious :ab
caveat, and applies the mysterious rap 3
operation. And here it is one more time:
++ urs %+ cook
|=(a=tape (rap 3 ^-((list ,@) a)))
(star ;~(pose nud low hig hep dot sig cab))
This time it's used on a sequence of zero or more of [-0-9a-zA-Z.~_]
.
The rest of the path functions are too complicated to investigate now, so let's leave them for later and move on.
But first, what about those duplications? How would the parser know which sym
is being referred to? Looking carefully, we can notice that they are at different indentation levels. The v1 sym
is at the first column, but the v2 ++ sym
is indented two spaces:
:: section 2eL, formatting (path)
++ ab
|%
++ bix (bass 16 (stun [2 2] six))
:: ... more lines at this indent ...
++ sym (cook |=(a=(list ,@) (rap 3 a)) ;~(plug vym:ab (star vyn:ab)))
:: ... more lines at this indent ...
--
++ ag
|%
++ ape |*(fel=_rule ;~(pose (cold 0 (just '0')) fel))
:: ... more lines at this indent ...
--
:: ... back to your regular programming ...
So somehow that |%
digraph is gathering together all of these parsing-units under the umbrella of ab
. Maybe that's why the :ab
suffix is needed at a few points in this section. After the end of the ++ ab |%
comes a similar ++ ag |%
, and we can see :ag
used similarly elsewhere in the file.
Note that |%
seems to open a tall-form list to be terminated by --
, whereas ;~
opened a tall-form list terminated by ==
. (When you read that sentence, did you say the digraph sounds "barcen", "hephep", "semsig", and "tistis" to yourself? If not, it's time to go hit those flashcards again.)
There's a lot more to learn from paths, but we've picked the low-hanging fruit. Time to move on.
We're now familiar with quite a few functions, but we know them by what they're supposed to do, not how they're implemented. Let's look at their definitions and see if we can learn more syntax.
++ shim
~/ %shim
|= zep=[p=@ q=@]
~/ %fun
|= tub=nail
^- (like char)
?@ q.tub
(fail tub)
?. ?&((gte i.q.tub p.zep) (lte i.q.tub q.zep))
(fail tub)
(next tub)
Recall that shim
takes two chars and creates an inclusive range. The first thing to notice is that ~/
. If you search the file, you'll notice it appears many times, and 99% of the time is followed by a %
and a tag of 3-4 lowercase letters. Usually, it's the first line under a function definition, with the tag matching the function name. Otherwise, the tag is very often fun
or else some part of the previous line. These look like debug messages. My guess is that they're used to create a human-readable execution trace. Assuming that's true, I'm going to remove them from code quotes from now on, for brevity.
The next thing in shim
is a |=
, which we know introduces a lambda expression. This one seems to take two arguments, named p
and q
, both of type atom. zep
seems like it might be an introduced name for the list of both arguments? The return value starts with |=
so it's another lambda, one argument (named tub
), type nail
. (If the idea of a function of two arguments returning a function of one argument confuses you, you might not have enough functional-programming background for this document.) So, what we've been thinking of as a "parsing-unit" is a "function of nail". What's the return type? Probably something to do with that ^- (like char)
line.
Our eye is next drawn to this line:
?. ?&((gte i.q.tub p.zep) (lte i.q.tub q.zep))
In a rare display of prolix, verbose, sesquipedalian loquacity, gte
and lte
have actually been labelled as "greater-equal" and "less-equal" respectively. This should be expected from what we know shim
must do. And what's being compared? i.q.tub
with p.zep
and q.zep
. We haven't seen i
declared anywhere before. But we know zep
to composed of a p
and a q
. So this must be backwards member addressing (just like :ab
and :ag
were used for backwards namespacing, above) . So we can deduce that whatever a nail
is, it must have a member q
which has a member i
which corresponds to an ascii code.
It also looks like ?&
is probably a logical AND; ?@
and ?.
are branching constructs; fail
is some kind of exception-raising or error-flagging code, and (next tub)
is what a parsing-unit returns on success. Let's check this understanding by looking at the other simple parsing-unit we know:
++ just
|= daf=char
|= tub=nail
^- (like char)
?@ q.tub
(fail tub)
?. =(daf i.q.tub)
(fail tub)
(next tub)
Seems legit. Specifically, ?@ x y
looks like "unless x is defined/present/valid, execute y" and ?. x y z
looks like "if x, then return z, else return y". Of course these are imperative-programming summaries, not functional ones. It's really true what the docs say: Hoon is a functional language that looks imperative.
Also, apparently =
can used as a prefix to test equality of two atoms.
Note that just
takes an argument of type char
whereas shim
declares its argument as @
. We know that both are used the same way, so these must be compatible types. just
also has a ^- (like char)
. Searching the file for other appearances of this string leads us to mask
:
++ mask
|= bud=(list char)
|= tub=nail
^- (like char)
?@ q.tub
(fail tub)
?. (lien bud |=(a=char =(i.q.tub a)))
(fail tub)
(next tub)
Recall that mask
takes a list of characters and checks that its argument, the nail
, matches one of them. Its code looks a lot like just
. By comparing them, we can basically see the meaning of lien
: it takes a list of items and a boolean function on the items, and returns true if the function is true on any of the items. If you know perl, this looks basically like the grep
function.
Let's check out what nail
actually looks like:
++ nail ,[p=hair q=tape]
++ hair ,[p=@ud q=@ud]
++ tape (list char)
++ char ,@tD
++ list |* a=_,*
$|(~ [i=a t=(list a)])
A nail
is a pair of a hair
(a pair of two @ud
s) and a tape
, which is a list
of char
, which is just a @tD
.
@tD
is not used anywhere else in hoon.hoon
, so we'll assume it's just a fundamental character type. Searching for @ud
, we find the best clue to its meaning down in the calendaring code:
++ moh `(list ,@ud)`[31 28 31 30 31 30 31 31 30 31 30 31 ~]
That looks like the number of days in each month of the year, written as list of
numeric literals, then cast (using backticks) to a list of @ud
. So @ud
is probably some
unsigned numeric type.
list
is our first recursive data type, and also our first generic one. It appears to use |*
to declare a single unrestricted parameter type a
, and use $|
to compose itself of two possible types: either the atom ~
, which seems to serve as list termination, or a cell containing i
for the head and t
for the tail. That must be the declaration of i
which we saw used in shim
and mask
above: i.q.tub
means "in tub
, which is of type nail
, access the member q
, which is of type tape
, aka (list char)
, and therein access the member i
, which is of type char
". (What's the ,
for in (list ,@ud)
? Some kind of type quoting.)
(Now that we understand ~
as a list terminator, its usage in several places above should make more sense.)
So what gets returned from a parsing-unit upon success? So far it always seems to be (next tub)
, q.v.:
++ next
|= tub=nail
^- (like char)
?@ q.tub
(fail tub)
=+ zac=(lust i.q.tub p.tub)
[zac [~ i.q.tub [zac t.q.tub]]]
::
++ lust |= [weq=char naz=hair]
^- hair
?:(=(10 weq) [+(p.naz) 1] [p.naz +(q.naz)])
As expected, next
takes a nail
as its argument (and again names it tub
). Its return type is (like char)
just like just
, shim
, and mask
. The new =+ x=y z
construct looks like "let x=y in z" since it's introducing a new variable name zac
. Its value is obtained by passing the first char on the tape and the hair into lust
.
What does lust
do with this char and this hair? It seems to be returning a new hair, based on its input hair, modified according to its input char. We haven't seen ?:
before, but by analogy to ?@
and ?.
, it's probably a branching construct. It looks like C's ternary expression foo ? bar : baz
; its probably the if-then-else
(as opposed to ?.
which is if-else-then). The condition is =(10 weq)
, i.e. "is this char a newline?" and if so, we're incrementing the hair's first value (with a prefix +
) and setting the second value to 1; otherwise, we're incrementing the second value. (Again, I'm abusing imperative terminology.) So it looks like a hair
must describe a position in the input by line number and column number; lust
moves to the next position.
Maybe it's called hair
because "hair" rhymes with "CHARacter"? Or because it's often part of word "HAIRline" or "crossHAIR"? And what about the name lust
? Perhaps the hoon author is attracted to luscious locks, and thus "lusts after hair". Then again, it's listed in hoon.hoon
right after this:
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: section 2eB, parsing (tracing) ::
::
++ last |= [zyc=hair naz=hair]
^- hair
?: =(p.zyc p.naz)
?:((gth q.zyc q.naz) zyc naz) :: gth == greater-than
?:((gth p.zyc p.naz) zyc naz)
That's essentially a lexicographic comparison between the two number-pairs, i.e., it returns the last of two hairs. (To be pendantic, in proper English we would say "latter", but that's not a four-letter word.) So maybe lust
was just the first four-letter word to come to mind after last
.
Anway, let's return to (next tub)
. We've just finished setting zac
to the next hair in the file, so that we can return this:
[zac [~ i.q.tub [zac t.q.tub]]]
The inner expression, [zac t.q.tub]
looks like an appropriate "next nail": zac
is the next hair after p.tub
and t.q.tub
is the "cdr", or "rest" of the tape, i.e. what remains after popping off the "car", or "head" char which is i.q.tub
. But why is zac
needed a second time, and what is that devilish ~
doing there?
To answer these questions, we need to know how the return value is going to be used. So far, the objects we're thinking of as "parsing-units" haven't really been used as inputs to much except cold
:
++ cold
|* [cus=* sef=_rule]
|= tub=nail
=+ vex=(sef tub)
?@ q.vex
vex
[p=p.vex q=[~ u=[p=cus q=q.u.q.vex]]]
We saw |*
above in the definition of list
; it seems to have something to do with parameterized types. According to our understanding of cold
(remember (cold 62 (just '-'))
?), cus
is the "constant semantic value" to be mapped onto the next character, and sef
is a parsing-unit to filter what the next character can be. So now we know what to call a parsing-unit: the official type is rule
(as in, a RULE on what characters are allowed to come next). Is that a new declaration digraph =_
? Presumably not because %tscb
doesn't appear in the file. Judging by other uses of _
, it seems to be a character you put before a type to get some other type.
The implementation of cold
matches our expectations of its behavior. First it passes the tub
to sef
, the original rule, and stores the result in vex
. If sef
failed, we pass the failure object up the stack. Otherwise, p=p.vex
keeps zac
, the next hair; q=q.u.q.vex
keeps the next nail [zac t.q.tub]
; p=cus
(instead of p=p.u.q.vex
) replaces the top char i.q.tub
with the custom value cus
. The mysterious ~
is untouched.
Let's try to understand this four-element structure better.
++ rule |=(tub=nail `edge`[p.tub ~ ~ tub]) ::
++ edge ,[p=hair q=(unit ,[p=* q=nail])] ::
Apparently a rule
is just a function from a nail
to an edge
. (So how is _rule
a type that can be attached to the input of cold
? Maybe this is a hint towards the "moronic type system" of hoon, wherein every type is actually just the range of a function.) The edge has four parts; the last one is a nail (whichever one was input to rule
) and the first one is a hair (the hair of the nail). The middle two elements are part of something called a unit
; they default to ~
and seem to have no discernable type information. We saw p=*
as the type declaration above for cold
. Maybe it means "any type"?
The definition of unit
looks very similar to list
, which I'll copy again:
++ unit |* a=_,* :: maybe
$|(~ [~ u=a]) ::
++ list |* a=_,*
$|(~ [i=a t=(list a)])
If you're familiar with monads, you might recognize unit
as "the maybe
monad", which is used to either wrap a value or indicate the absence of a value. The implementation again uses $|
to declare an 'atom-or-cell' type, where the only allowed atom is ~
, which indicates "null". The allowed cell is [~ u=a]
, where the first value in the cell is just a placeholder.
So then, looking back at that edge
returned at the end of cold
:
[p=p.vex q=[~ u=[p=cus q=q.u.q.vex]]]
It starts with the hair, then has the non-null, cell-form for the unit<[*,nail]>
, and the value of the unit is [cus, q.u.q.vex]
. So we still don't know why the hair of the nail is duplicated at the top level of the edge. Maybe just for convenience? Or maybe it will sometimes turn out to have a different value. Well, what would make sense to go there? If we think of EDGE as meaning a one-dimensional span, maybe the two hairs (which, remember, are line/column character positions) correspond to the start and end of an expression. Maybe they're always the same so far because we've only been looking at one-character expressions. Let's test this theory by investigating plug
, which we know can make a rule for a two-character expression.
++ plug
|* [vex=edge sab=_rule]
?@ q.vex
vex
=+ yit=(sab q.u.q.vex)
=+ yur=(last p.vex p.yit)
?@ q.yit
[p=yur q=q.yit]
[p=yur q=[~ u=[p=[p.u.q.vex p.u.q.yit] q=q.u.q.yit]]]
Hmmm... every time we've seen plug
it's been followed by a bunch of rules and preceded by the digraph ;~
to make a new rule, as in (cold ~ ;~(plug tis hep))
. But plug
doesn't take a list of rules, it takes an edge (the output of a rule) and a _rule
. there's something fishy going on here between that _, and using |*
instead of |=
, and the use of ;~
.
I have a hunch that if you know anything about monads, you might be way ahead of me right now. I don't know anything about monads. In fact, every time the word "monad" comes out of someone's mouth, I flinch involuntarily like a conditioned dog, because in my experience what generally comes out of that mouth next is going to be an unintelligible explosion of spittle. The one exception I've seen didn't explain why everyone else loves to mouthfoam about monads, but it did sketch some simple function compositions that seemed kinda handy and kinda remind me of what's going on here with plug
.
So without actually trying to understand how this links up to fancy FP monads, let's see if we can reason through what must be happening. In ;~(plug tis hep)
, the ;~
rune operates on a list of functions ([tis hep]
), given a function-combinator (plug
). In this case, each function is a rule, which is a function of type nail -> edge
, and each edge contains a nail. Let's assume this means that the functions listed inside a ;~
expression must all have the type x -> y
while the function-combinator has type y, [x -> y] -> y
, or in our case, edge, rule -> edge
. The meaning of the ;~
expression is then a function which:
- takes in an
x
; - passes it to the first function in the list to get a
y
; - passes that, and the next function in the list, into the function-combinator to get a new
y
; - keeps repeating the previous step until getting to the end of the list;
- returns the final
y
In our case, plug
will be passed the edge resulting from the tis
(which it names vex
), along with the hep
. First it checks if the tis
failed, and returns the failure if so. Then it runs hep
on the remaining nail, storing the resulting edge in yit
. It then compares the hairs of the two edges, putting the latter one in yur
(It seems that this will always be p.yit
; why is this comparison necessary? Can a rule somehow move the hair backwards?). It then checks whether the hep
failed, and returns the failure if so. (Shouldn't this be done before the previous step?) Finally, it composes and returns an edge, which we know has four parts:
- a hair: we use
yur
- a
~
: a placeholder to distinguish a valued-unit from the null-unit - a "semantic value": we use the cell of the semantic value returned by
tis
and the one returned byhep
- a nail: we use the one returned from the
hep
Let's check this understanding of ;~
by looking at the other times we've seen it used. The first was on ;~(pose low hig)
.
++ pose
|* [vex=edge sab=_rule]
?@ q.vex
=+ roq=(sab)
[p=(last p.vex p.roq) q=q.roq]
vex
In this example, vex
is the edge produced by running low
on the tape. If it was a success, we just return it. Otherwise, we set roq
to be (hig)
. But wait, where's the argument in this function call? There must be some kind of a "default subject" here. Clearly what's being inserted is the same nail
that went into low
to produce the edge vex
, but in the scope of this function we don't have access to it. Is this something like perl's default argument $_
? Anyway, however it works, the resulting edge
has the nail-unit from roq
, but the hair is again the latter of our two hairs. This seems fishy. What if you pose
d two rules with different lengths? If the first rule fails, shouldn't we always use the hair from the second rule, even if the first hair was "laster"? Well, perhaps there's a contract about what the hair must be in an edge that failed. So far we've seen shim
, mask
, and just
return (fail tub)
on error.
++ fail |=(tub=nail [p=p.tub q=~])
This is using the "null" value of the unit<[*,nail]>
. (Incidentally, we can now guess the exact meaning of ?@
: if must be "if-atom-then-else". When its argument is a unit
, it will be an atom only if it is the null value, i.e. representing failure.) So if we assume a failed rule never advances the hair, then the comparison in pose
isn't incorrect, but it is extraneous.
We also saw ;~(pfix bas ;~(pose bas soq bix))
:
++ pfix
|* [vex=edge sab=_rule]
?@ q.vex
vex
=+ yit=(sab q.u.q.vex)
[p=(last p.yit p.vex) q=q.yit]
++ sfix
|* [vex=edge sab=_rule]
?@ q.vex
vex
=+ yit=(sab q.u.q.vex)
[p=(last p.vex p.yit) q=?@(q.yit ~ [~ u=[p=p.u.q.vex q=q.u.q.yit]])]
++ ifix
|* [fel=[p=_rule q=_rule] hof=_rule]
;~(pfix p.fel ;~(sfix hof q.fel))
Seemingly pfix
works exactly like plug
, except instead of composing the semantic values into a cell, it returns the last value and drops the others on the floor. Similarly, sfix
returns the first value and drops the others on the floor. ifix
does just what you'd expect with them. Note that ifix
isn't used with ;~
; it does all the lifting inside itself. Using a sufficiently tricky accumulator, it should be theoretically possible to write a version of ifix
that would be called as ;~(ifix soq soq (boss 256 (more gon qit)))
instead of (ifix [soq soq] (boss 256 (more gon qit)))
.
Now's a great time to do some exercises. ++ cook
has nothing new in it; you should be able to write it based on what we already know. Try it out and check your answer. Did you miss anything? (Spoiler: I made one mistake, with the type declaration of the first parameter.)
Let's go back and take a look at (easy ~)
which we saw above.
++ easy
|* huf=*
|= tub=nail
^- (like _huf)
[p=p.tub q=[~ u=[p=huf q=tub]]]
We saw ^- (like char)
above in just, mask, and next. It seems to declare a return type of an edge whose "semantic value" is a char. In this case, easy
can put anything as the semantic value, and does not have any requirements or advance the tape. So it is a zero-length, zero-meaning parsing stub. This would seem to pose a parser problem, since if (easy ~)
is ever a valid rule for the top of the tape, it would always stay a valid rule, and the parser would be stuck in an infinite loop. Fortunately, so far we've only seen it in two ways: the last element of a plug
, where presumably it uses its ~
to terminate a list; and in a pose
construct composing other zero-length parses (gon
and gay
). As long as none of these zero-length rules floats to the top of the parse stack, we should be okay.
Let's take a closer look at like
:
++ like |* a=_,* :: generic edge
|= b=_`*`[(hair) ~] ::
:- p=(hair -.b) ::
^= q ::
?@ +.b ~ ::
:- ~ ::
u=[p=(a +>-.b) q=[p=(hair -.b) q=(tape +.b)]] ::
The first line, |* a=_,*
, we saw before in unit
and list
: it declares a single unrestricted type parameter. Then we enter a lambda which introduces b
. We know that a tec-quoted prefix is used to explicitly cast (remember ++moh
?), but why would you ever cast something to *
? The type of b
before this cast is [(hair) ~]
, which is just a template for the emtpy edge (the emtpy hair followed by the null-unit). (Why write [(hair) ~]
instead of (edge)
?) Remember, (like char)
is not just a type, it's a function which accepts all the things in that type: any edge with a char for a semantic value.
We haven't seen -.b
before, but it looks like it just refers to the head of a cell; this is borne out by typing =+(b=[41 42] -.b)
into the vere
REPL; it spits back 41
. Two lines later we see +.b
which looks like it means the tail of a cell, and indeed the REPL confirms for us that =(42 =+(b=[41 42] +.b))
.
Now what is :-
doing? We expect the output of (like char)
to be an edge<char>
, that is, something like [p=(hair) q=[~ u=[p=(char) q=(nail)]]]
. We see all those pieces in the implementation of like
, but they're spread out. :-
seems to be a way to assemble a cell over multiple lines, doing for []
constructns what %+
did for ()
. Playing in the REPL, it seems that :-(a b)
has the same meaning as [a b]
, but only inside an expression. You can't just type :-(1 2)
the same way you can type [1 2]
at the prompt, but you can type =(|=([p=@ q=^] p) |=(:-(p=@ q=^) p))
or =(=+(:-(p=1 q=2) [p q]) =+([p=1 q=2] [p q]))
.
Again, we haven't seen ^=
before, but ^=(q foo)
seems to be another form of [q=foo]
; that is, ^=
can assemble the "named variable/type declaration" which we've seen over and over using infix =
. So if we had enough columns we could write like
on one line like this:
|*(a=_,* |=(b=_`*`[(hair) ~] [p=(hair -.b) q=?@(+.b ~ [~ u=[p=(a +>-.b) q=[p=(hair -.b) q=(tape +.b)]]])]))
Glyph | Usage |
---|---|
soq...soq | single-quoted text |
pel...per | function application |
pat | atom-type. equals 0. may be odorous. |
tis | type declaration, variable assignment, equality test |
sel...ser | cell; lambda multivariate type declaration |
sig | placeholder; null unit; end of list. equals 0. |
bas...fas | mid-string linebreak escape |
bas | string escaping |
col | backwards namespacing |
com | type quoting??? |
cen | numeric literal prefix |
dot | backwards member addressing |
lus | increment |
tar | any type |
cab | type quoting??? |
tec | casting |
Digraph | Usage |
---|---|
colcol | begin a comment |
luslus | declare/define a funcion |
bartis | start a lambda gate |
tistis | terminate a tall list |
hephep | also terminate a tall list |
kethep | explicit cast |
cenlus | function application |
barcen | union types? |
sigfas | debug trace |
wutpat | if-atom-then-else |
wutdot | if-else-then |
wutpam | boolean and |
bartar | start a lambda gate with type polymorphism |
bucbar | atom-or-cell type |
tislus | declare variable pam add to scope |
wutcol | if-then-else |
semsig | monad-like function combination? |
hepdot | head of cell |
lusdot | tail of cell |
colhep | assemble cell |
kettis | assemble infix = (named variable / type declaration) |
lusgar | ??? |