Skip to content
PhaseMage edited this page Nov 3, 2013 · 9 revisions

About

These are the notes from abliss's attempt to decipher the Hoon language. It is current as of commit 2ea195, but the code samples below are not live (can github wiki do that?), so this document will quickly fall out of date. I still haven't figured it all out, so please contribute if you know more than I do or you have ideas for a better presentation.

I will try to avoid putting in too many spoilers, mostly by writing this as I go. But for maximum enjoyment of the puzzle, you should try to figure it out without any help.

Over in the sidebar I'll be collecting a "cheat sheet" of syntax rules, in the order I figure them out. Each symbol or digraph is linked to the place in this document where I explain what I think it does.

Strategy

Hoon has scarce documentation, so it would seem the best way to learn Hoon is to understand its compiler, hoon.hoon. By examining its form and its content, we will gradually discover Hoon's alien syntax.

Glyphs

The best place to start is probably section 2eF, parsing (ascii), which will teach you the three-letter-name for each ascii symbol. Like the docs say, you just need to memorize these. A set of flashcards would not be a bad idea. In case it helps, I've relisted them below in a different order, with some mnemonics added in the comments (which are clearly supposed to be set off by ::):

++  ace  (just ' ')         :: spACE

++  gal  (just '<')         :: GreAter Left
++  gar  (just '>')         :: GreAter Right
++  pel  (just '(')         :: ParEnthesis Left
++  per  (just ')')         :: ParEnthesis Right
++  kel  (just '{')         :: curly bracKEt Left
++  ker  (just '}')         :: curly bracKEt Right
++  sel  (just '[')         :: SquarE bracket Left
++  ser  (just ']')         :: SquarE bracket Right

++  doq  (just '"')         :: DOuble Quote
++  soq  (just '\'')        :: Single (O) Quote

++  col  (just ':')         :: COLon
++  sem  (just ';')         :: SEMicolon
++  com  (just ',')         :: COMma
++  dot  (just '.')         :: period looks like a DOT

++  tec  (just '`')         :: backT(E)iCk
++  sig  (just '~')         :: ?people use tilde in their SIGnature? or ?SquIGggle?
++  zap  (just '!')         :: onomatopoeia. often followed by bang character, or the word "bang", in comics.
++  pat  (just '@')         :: (P)AT sign
++  hax  (just '#')         :: ?HAsh tag (X)?
++  buc  (just '$')         :: BUCk is american slang for dollar
++  cen  (just '%')         :: perCENt sign
++  ket  (just '^')         :: (K)carET? or looks like a rotated version of what physicists call the bra-KET?
++  pam  (just '&')         :: AMPersand (jumbled so as to be consonant-vowel-consonant)
++  tar  (just '*')         :: sTAR

++  hep  (just '-')         :: HyPhEn (jumbled so as to be consonant-vowel-consonant)
++  cab  (just '_')         :: ???
++  lus  (just '+')         :: pLUS sign
++  tis  (just '=')         :: old english contraction for "it is", connoting equality
++  bar  (just '|')         :: pipe looks like a vertical BAR
++  bas  (just '\\')        :: BAckSlash
++  fas  (just '/')         :: ForwArd Slash
++  wut  (just '?')         :: lolcat misspelling of "what?", connoting interrogative

By the way, if we didn't already know it, we can start to get some hints about basic hoon syntax: ++ is used to define a parsing-unit, whatever that is; just is somehow used to refer to a single character; ' is used to quote text; and () is used as in Lisp to apply a function to an argument.

Character classes

After this section things get confusing, but if we skip down a bit there are some more good hints. Let's take a look at section 2eH, parsing (idioms). Again, I've reordered it and added mnemonics in the comments.

++  hig  (shim 'A' 'Z')                 :: HIGh/upper case letters
++  low  (shim 'a' 'z')                 :: LOWer case letters
++  nud  (shim '0' '9')                 :: NUmerical Digits

It seems that shim defines a range of ascii characters by taking its start and end, inclusive.

++  alf  ;~(pose low hig)               :: AL(F)phabetical letters, any case
++  aln  ;~(pose low hig nud)           :: ALphaNumeric letters/numbers
++  alp  ;~(pose low hig nud hep)       :: ALphanumerics, possibly hyPhenated

Apparently we can use pose to comPOSE multiple parsing-units in a logical union. The ;~ is still a mystery.

++  dit  (cook |=(a=@ (sub a '0')) (shim '0' '9'))  :: Decimal digIT
++  but  (cook |=(a=@ (sub a '0')) (shim '0' '1'))  :: Binary (U) digiT

Here we encounter the cook construct. This is some kind of mapping operation which serves to convert the ascii character range into its semantic values. It uses sub to SUBtract the base ascii value of '0', which is 48, from each ascii value in the range. In these lines, a appears to be a bound variable in something like a lambda construct (perhaps this is what |= is for).

++  hit  ;~  pose                                   :: Hexadecimal digIT
       dit
       (cook |=(a=char (sub a 87)) (shim 'a' 'f'))
       (cook |=(a=char (sub a 55)) (shim 'A' 'F'))
     ==

This is the first multi-line parsing-unit, but we've seen all its parts before. This is the "tall form" of the ;~(pose x y z) construct we met above; == is the terminator for the list.

Combining our previous knowledge, we can see how this builds the parsing-unit for a hexadecimal digit. Like the decimal and binary digits, this converts an ascii character to a semantic numerical value. It uses pose to cover the three possibilities of the character: either it is a decimal digit with its base-10 semantic value (covered by dit), or a lower-case letter in the range a-f with its ascii value minus 87 (which is the ascii value of 'a' minus 10), or an upper-case letter in the range A-F with its ascii value minus 55 (which is the ascii value of 'A' minus 10).

Note that the lambda construct in the cook used here differs slightly from the previous one: it has char instead of @. What could that mean?

Also, note the lovely way these parsing-units fit together. Each parsing-unit neatly serves two purposes: defining exactly which characters it accepts, and assigning each of those characters a semantic value. pose manages to combine both purposes of all its arguments to build a larger parsing-unit.

++  mes  (cook |=([a=@ b=@] (add (mul 16 a) b)) ;~(plug hit hit)) :: ???

I have no clue what mnemonic inspired the name mes. But if hit is a single hexadecimal digit, this must assemble two of them into a byte value using add and mul to do the arithmetic. Also, we see plug, which seems to take two parsing units and join them together in some way that's compatible with the |= lambda construct (which we now know how to write with two arguments using []).

The arguments are back to using @ instead of char. Thinking carefully about the semantics of these last three lambdas, the meaning starts to become clear. We use char to declare that we are accepting a CHARacter and computing from its ascii value. We use @ to declare that we are accepting an ATom and computing from its numerical value (recall that hit has already translated from ascii to a number). So these must be type declarations.

++  vit                                                           :: ???
  ;~  pose
    (cook |=(a=@ (sub a 65)) (shim 'A' 'Z'))
    (cook |=(a=@ (sub a 71)) (shim 'a' 'z'))
    (cook |=(a=@ (add a 4)) (shim '0' '9'))
    (cold 62 (just '-'))
    (cold 63 (just '+'))
  ==

Okay, this is another pose composition of multiple cases, in tall form. The first three cover alphanumerics like before, but now they are cooked into different values. A-Z are mapped to 0-25; a-z are 26-51; 0-9 are 52-61. With two more characters - and + to serve as 62 and 63, it looks like we've got ourselves a base64 digit (but this isn't any of the standard encodings). And what's cold? It appears to take a parsing-unit and simply change its output value. Why "cold"? Just guessing, but, this operator is a bit like the K combinator (K for the word "Constant" in German), which "hOLDs a value for later use"--(but note that it keeps its argument's "range of acceptable inputs" and only throws away its "map to semantic value").

++  bet  ;~(pose (cold 2 hep) (cold 3 lus))                       :: ???

Based on what we know so far, this will accept a + character and map it to 3, or a - character and map it to 2. No clue what "bet" stands for, or why these values were chosen. Perhaps this will be used for parsing the sign of a numerical literal? Indeed, 2 and 3 in the Nock virtual machines are the basic left/right selection operators, and therefore can act like booleans.

++  gul  ;~(pose (cold 2 gal) (cold 3 gar))                       :: ???

Similarly, gul will map a < to a 2, and a > to a 3. That would be useful for parsing XML-like tags.

++  poy  ;~(pfix bas ;~(pose bas soq mes))                        :: ???

The inner pose is going to take a \, a ', or a two-character hex byte; this is preceded by a \. This looks useful for parsing an escaped entity inside a single-quoted string literal. That means pfix must create a PreFIX parse-unit from two parse-unit arguments.

++  qit  ;~(pose (shim 32 38) (shim 40 91) (shim 93 126) (shim 128 255) poy)

Looking at an ascii chart, these ranges define the printable characters minus 39 (') and 92 (\). So these are the Quo(I)Table characters, and with the addition of poy this parsing-unit can represent anything found inside a single-quoted string.

++  bin  (bass 2 (most gon but))            :: BINary number
++  dem  (bass 10 (most gon dit))           :: DEciMal number
++  hex  (bass 16 (most gon hit))           :: HEXadecimal number

Okay, clearly these parsing-units must assemble binary, decimal, and hexadecimal digits into numbers. (most gon but) must mean "take the MOST characters you can which match but, and return a list of their parsed values". bass must take a radix r and a list of numbers in 0..r-1, then convert them from a sequence of base-r digits into the total semantic value. Presumably bass is a pun on "BASS", which in music is pronounced like the word "base". (Why isn't there one for base64?) The purpose of gon is not yet clear.

++  qut  (ifix [soq soq] (boss 256 (more gon qit)))         :: QUoTed string

This probably works similarly to the above. I assume more works like most (what's the difference?) and eats all quotable characters. boss must work like bass (what's the difference?) and converts the string of byte values into a single atom (big-endian or little?) Recalling that soq means ', it's clear that ifix must mean InFIX: it takes a list of two parsing-units to serve as the delimiters, then performs its third argument on the contents. Again, what's that gon for?

++  nix  (boss 256 (star ;~(pose aln cab)))                 :: uNIX ?

This is another way to build an atom out of a string of characters, but it only accepts alphanumerics (aln) and underscores (cab). Maybe this is for restricting to strings that could be safe names on a unix filesystem. What's star? Could that be a reference to regular expressions, requiring zero or more of its argument? Oh, that might explain the difference between more/boss and most/bass: the former should allow for empty lists (presumably assigning the value 0); the latter shouldn't. This means the empty string '' would parse to the atom 0. (I guess the 'o's in "more" and "boss" are supposed to remind us that 0 is a possible value.)

There are only three more parsing-units in this section: sym and ven, which look complicated, and gon, which depends on gay which is in the whitespace section.

Whitespace

Turning our attention to section 2eG, parsing (whitespace), we see some now-familiar cosntructs:

++  dun  (cold ~ ;~(plug hep hep))          :: ???
++  duq  (cold ~ ;~(plug tis hep))          :: ???
++  duz  (cold ~ ;~(plug tis tis))          :: ???

These parse the strings --, =-, and ==, evaluating all of them to the same value which is represented by ~. We don't know what that value is, yet.

++  vul  (cold ~ ;~(plug col col (star (shim 32 126)) (just ^-(@ 10))))

According to our theories, this should match two colons, followed by zero or more printable characters, followed by... well, that's presumably ascii 10, the newline character. This must be for parsing comments, which suggests the mnemonic is VULgate. This also suggests the meaning of cold ~ as giving the characters zero semantic value, or possibly meaning they must come at the end of a line or expression: recall that we saw == before, on a line by itself, as the tall-form list-terminator.

++  gap  (cold ~ (plus ;~(pose vul (mask [^-(@ 10) ' ' ~]))))

The inner pose will match a comment or... well, seemingly either a newline, or a space, or maybe whatever ~ means. So mask can take a list of ascii values and put them together into a parsing-unit, sort of like shim but not necessarily continuous. It's probably named for bitMASK, a quick way to test inclusion against a predetermined set.

If we're right that star is named for the regular expression mark indicating "zero or more of these", perhaps plus is named for the corresponding mark indicating "one or more of these". Whatever is found, the cold zeros out its semantic meaning, so gap seems to represent the GAP between two expressions.

++  dog  ;~(plug dot gay)                       :: ???
++  doh  ;~(plug ;~(plug hep hep) gay)          :: ???
++  gon  ;~(pose ;~(plug bas gay fas) (easy ~)) :: ???
++  gay  ;~(pose gap (easy ~))                  :: ???

So dog matches a . followed by whatever gay matches. doh, similarly, matches -- followed by gay. But note that it isn't written ;~(plug hep hep gay), nor ;~(plug dun gay), both of which ought to match the same things. The latter would throw away the meaning of the --; the former would combine their meaning with the meaning of gay. Whatever doh is used for it, it must require the meaning of -- and the meaning of gay separately.

gon is leftover from the idiom section, but I tossed it in here because it seems to fit. Its plug is a \, a gay, and a /. While scrolling around in the file, I did notice this in section 2cE:

^=  sis
'dozmarbinwansamlitsighidfidlissogdirwacsabwissib\
/rigsoldopmodfoglidhopdardorlorhodfolrintogsilmir\
...
/nimlarfitwalrapsarnalmoslandondanladdovrivbacpol\
/laptalpitnambonrostonfodponsovnocsorlavmatmipfap'

So maybe gon is used to break up a long string literal onto multiple lines, and gay is something like "a newline followed by an indent", which is supported by the presence of gap in its pose. But (easy ~) is still a mystery.

Paths

The next section that looks simple enough to tackle is 2eL, formatting (path). From the existing docs we know a bit about what paths look like, so maybe this won't be too bad.

++  sid  (cook |=(a=@ (sub a '0')) (shim '0' '9'))
++  sib  (cook |=(a=@ (sub a '0')) (shim '0' '1'))

Thes are identical to dit and but above. Why the duplication?

++  sed  (cook |=(a=@ (sub a '0')) (shim '1' '9'))
++  seb  (cold 1 (just '1'))

Decimal/Binary digits Excluding 0. Maybe for the leading character of a numerical literal?

++  sox  (cook |=(a=@ (sub a 87)) (shim 'a' 'f'))
++  six  ;~(pose sid sox)
++  sex  ;~(pose sed sox)

This is just like hit except it only accepts lower-case.

++  sov  (cook |=(a=@ (sub a 87)) (shim 'a' 'v'))
++  siv  ;~(pose sid sov)
++  sev  ;~(pose sed sov)

Looks like a base-32 digit.

++  sow  ;~  pose
         (cook |=(a=@ (sub a 87)) (shim 'a' 'z'))
         (cook |=(a=@ (sub a 29)) (shim 'A' 'Z'))
         (cold 62 (just '-'))
         (cold 63 (just '~'))
       ==
++  siw  ;~(pose sid sow)
++  sew  ;~(pose sed sow)

Apparently even the Martians aren't immune to the law that every software project must contain multiple incompatible base64 encoders.

By now the pattern for these names should be pretty obvious: first character s; second character 'o' for extended digits only, 'i' for all digits, 'e' for digits Excluding 0; last character d for Decimal, b for Binary, x for heX, v for base32 (highest digit is v), w for base64.

++  voy  ;~(pfix bas ;~(pose bas soq bix))

voy is very much like the escape-processor poy above, except instead of using mes (two hex digits encoding a byte), this uses bix:

++  bix  (bass 16 (stun [2 2] six))

So presumably bix uses six like mes uses hix, i.e. it's a lowercase-only two-character hex byte. Instead of manually assembling it with cook, add, and mul, this one makes use of bass and the new stun, which presumably allows a specific number of repeats of its arguments like {2,2} in a regex. Here are a bunch more examples:

++  qib  (bass 2 (stun [4 4] sib))
++  tid  (bass 10 (stun [3 3] sid))
++  qix  (bass 16 (stun [4 4] six))
++  piv  (bass 32 (stun [5 5] siv))
++  piw  (bass 64 (stun [5 5] siw))
++  til  (boss 256 (stun [3 3] low))

++  qeb  (bass 2 ;~(plug seb (stun [0 3] sib)))
++  ted  (bass 10 ;~(plug sed (stun [0 2] sid)))
++  qex  (bass 16 ;~(plug sex (stun [0 3] hit)))
++  pev  (bass 32 ;~(plug sev (stun [0 4] siv)))
++  pew  (bass 64 ;~(plug sew (stun [0 4] siw)))

Again, the last character tells us the base, and the middle character whether it may begin with a zero. The first character seems to say something about the length.

Down a little lower, something odd jumps out: a new definition of qut. Here it is side-by-side with the old one:

++  qut  (ifix [soq soq] (boss 256 (more gon qit)))         :: v1, above
++  qut  %+  ifix  [soq soq]                                :: v2
         %+  boss  256
         %-  star  ;~  pose
                     ;~(pfix bas ;~(pose bas soq bix:ab))
                     ;~(pose (shim 32 38) (shim 40 91) (shim 93 126))
                   ==

There's a new digraph here, %+ -- how does it differ from the ;~ we've been seeing? And there's also bix:ab; how is that different from just bix? v2 seems to differ from v1 in a couple ways: it doesn't allow ascii chars 128-255; and by using bix instead of the mes that appears in the poy used by qit, it allows only lower-case hex in the escape chars.

This section also includes a second sym, which we should compare to the sym we never looked at in the "idiom" section above. But it depends on these two, so let's check them out first:

++  vym  (bass 256 ;~(plug low (star ;~(pose low nud))))
++  vyn  (bass 256 ;~(plug hep vym (easy ~)))

vym is encoded like nix above, but must match the regex [a-z][a-z0-9]*. vym looks like a Negative version of vym since it's preceded by a hyphen, and is terminated by that strange (easy ~) thing. These are only used by the new sym:

++  sym                                                                 ::v1
  %+  cook
    |=(a=tape (rap 3 ^-((list ,@) a)))
  ;~(plug low (star ;~(pose nud low hep)))
++  sym  (cook |=(a=(list ,@) (rap 3 a)) ;~(plug vym:ab (star vyn:ab))) ::v2

This parallelism illustrates the purpose of %+: it creates a function-application-list like (), but in tall form (over multiple lines). We still don't understand :ab or the (easy ~) in the vyn. But if we just ignore those, it would seem that both versions of sym accept the same things: a lowercase letter, followed by zero or more lowercase, numerical, or hyphen characters. The two versions then cook up values in a slightly different way, though they look pretty similar. In v2, the argument to the lambda construct has type (list ,@), which might mean a list of atoms; in v1, the argument is tape, but looks like it's later cast to a (list ,@)--this is reminiscent of java's varargs. In both cases we rap 3 the list of arguments. I wonder what that does? It's used again here:

++  siq  ;~  pose
           (shim 'a' 'z')
           (shim 'A' 'Z')
           (shim '0' '9')
           hep
           (cold 32 dot)
           ;~(pfix sig ;~(pose sig dot bix))
         ==
++  tyq  (cook |=(a=(list ,@) (rap 3 a)) (plus siq:ab))

siq seems pretty straightforward. It accepts the chars [a-zA-Z0-9], plus -, and takes them at their ascii codepoints. It also accepts . but changes its code from 46 to 32 for some reason. Finally, it can use ~ to escape a ., another ~, or a lowercase hex byte. tyq takes a list of one or more siq, with that mysterious :ab caveat, and applies the mysterious rap 3 operation. And here it is one more time:

++  urs  %+  cook
           |=(a=tape (rap 3 ^-((list ,@) a)))
         (star ;~(pose nud low hig hep dot sig cab)) 

This time it's used on a sequence of zero or more of [-0-9a-zA-Z.~_].

The rest of the path functions are too complicated to investigate now, so let's leave them for later and move on.

But first, what about those duplications? How would the parser know which sym is being referred to? Looking carefully, we can notice that they are at different indentation levels. The v1 sym is at the first column, but the v2 ++ sym is indented two spaces:

::                section 2eL, formatting (path)       
++  ab 
  |%
  ++  bix  (bass 16 (stun [2 2] six))
  :: ... more lines at this indent ...
  ++  sym  (cook |=(a=(list ,@) (rap 3 a)) ;~(plug vym:ab (star vyn:ab)))
  :: ... more lines at this indent ...
  --
++  ag
  |%
  ++  ape  |*(fel=_rule ;~(pose (cold 0 (just '0')) fel))
  :: ... more lines at this indent ...
  --
:: ... back to your regular programming ...

So somehow that |% digraph is gathering together all of these parsing-units under the umbrella of ab. Maybe that's why the :ab suffix is needed at a few points in this section. After the end of the ++ ab |% comes a similar ++ ag |%, and we can see :ag used similarly elsewhere in the file.

Note that |% seems to open a tall-form list to be terminated by --, whereas ;~ opened a tall-form list terminated by ==. (When you read that sentence, did you say the digraph sounds "barcen", "hephep", "semsig", and "tistis" to yourself? If not, it's time to go hit those flashcards again.)

There's a lot more to learn from paths, but we've picked the low-hanging fruit. Time to move on.

The form of a rule

We're now familiar with quite a few functions, but we know them by what they're supposed to do, not how they're implemented. Let's look at their definitions and see if we can learn more syntax.

++  shim
  ~/  %shim
  |=  zep=[p=@ q=@]
  ~/  %fun
  |=  tub=nail
  ^-  (like char)
  ?@  q.tub
    (fail tub) 
  ?.  ?&((gte i.q.tub p.zep) (lte i.q.tub q.zep))
    (fail tub)
  (next tub)

Recall that shim takes two chars and creates an inclusive range. The first thing to notice is that ~/. If you search the file, you'll notice it appears many times, and 99% of the time is followed by a % and a tag of 3-4 lowercase letters. Usually, it's the first line under a function definition, with the tag matching the function name. Otherwise, the tag is very often fun or else some part of the previous line. These look like debug messages. My guess is that they're used to create a human-readable execution trace. Assuming that's true, I'm going to remove them from code quotes from now on, for brevity.

The next thing in shim is a |=, which we know introduces a lambda expression. This one seems to take two arguments, named p and q, both of type atom. zep seems like it might be an introduced name for the list of both arguments? The return value starts with |= so it's another lambda, one argument (named tub), type nail. (If the idea of a function of two arguments returning a function of one argument confuses you, you might not have enough functional-programming background for this document.) So, what we've been thinking of as a "parsing-unit" is a "function of nail". What's the return type? Probably something to do with that ^- (like char) line.

Our eye is next drawn to this line:

  ?.  ?&((gte i.q.tub p.zep) (lte i.q.tub q.zep))

In a rare display of prolix, verbose, sesquipedalian loquacity, gte and lte have actually been labelled as "greater-equal" and "less-equal" respectively. This should be expected from what we know shim must do. And what's being compared? i.q.tub with p.zep and q.zep. We haven't seen i declared anywhere before. But we know zep to composed of a p and a q. So this must be backwards member addressing (just like :ab and :ag were used for backwards namespacing, above) . So we can deduce that whatever a nail is, it must have a member q which has a member i which corresponds to an ascii code.

It also looks like ?& is probably a logical AND; ?@ and ?. are branching constructs; fail is some kind of exception-raising or error-flagging code, and (next tub) is what a parsing-unit returns on success. Let's check this understanding by looking at the other simple parsing-unit we know:

++  just
  |=  daf=char
  |=  tub=nail
  ^-  (like char)
  ?@  q.tub
    (fail tub) 
  ?.  =(daf i.q.tub)
    (fail tub)
  (next tub)

Seems legit. Specifically, ?@ x y looks like "unless x is defined/present/valid, execute y" and ?. x y z looks like "if x, then return z, else return y". Of course these are imperative-programming summaries, not functional ones. It's really true what the docs say: Hoon is a functional language that looks imperative.

Also, apparently = can used as a prefix to test equality of two atoms.

Note that just takes an argument of type char whereas shim declares its argument as @. We know that both are used the same way, so these must be compatible types. just also has a ^- (like char). Searching the file for other appearances of this string leads us to mask:

++  mask
  |=  bud=(list char)
  |=  tub=nail
  ^-  (like char)
  ?@  q.tub
    (fail tub) 
  ?.  (lien bud |=(a=char =(i.q.tub a)))
    (fail tub)
  (next tub)

Recall that mask takes a list of characters and checks that its argument, the nail, matches one of them. Its code looks a lot like just. By comparing them, we can basically see the meaning of lien: it takes a list of items and a boolean function on the items, and returns true if the function is true on any of the items. If you know perl, this looks basically like the grep function.

Let's check out what nail actually looks like:

++  nail  ,[p=hair q=tape]
++  hair  ,[p=@ud q=@ud]
++  tape  (list char)
++  char  ,@tD
++  list  |*  a=_,*
          $|(~ [i=a t=(list a)])

A nail is a pair of a hair (a pair of two @uds) and a tape, which is a list of char, which is just a @tD.

@tD is not used anywhere else in hoon.hoon, so we'll assume it's just a fundamental character type. Searching for @ud, we find the best clue to its meaning down in the calendaring code:

++  moh  `(list ,@ud)`[31 28 31 30 31 30 31 31 30 31 30 31 ~]

That looks like the number of days in each month of the year, written as list of numeric literals, then cast (using backticks) to a list of @ud. So @ud is probably some unsigned numeric type.

list is our first recursive data type, and also our first generic one. It appears to use |* to declare a single unrestricted parameter type a, and use $| to compose itself of two possible types: either the atom ~, which seems to serve as list termination, or a cell containing i for the head and t for the tail. That must be the declaration of i which we saw used in shim and mask above: i.q.tub means "in tub, which is of type nail, access the member q, which is of type tape, aka (list char), and therein access the member i, which is of type char". (What's the , for in (list ,@ud)? Some kind of type quoting.)

(Now that we understand ~ as a list terminator, its usage in several places above should make more sense.)

So what gets returned from a parsing-unit upon success? So far it always seems to be (next tub), q.v.:

++  next
  |=  tub=nail
  ^-  (like char)
  ?@  q.tub
    (fail tub) 
  =+  zac=(lust i.q.tub p.tub) 
  [zac [~ i.q.tub [zac t.q.tub]]]
::
++  lust  |=  [weq=char naz=hair]
          ^-  hair
          ?:(=(10 weq) [+(p.naz) 1] [p.naz +(q.naz)])

As expected, next takes a nail as its argument (and again names it tub). Its return type is (like char) just like just, shim, and mask. The new =+ x=y z construct looks like "let x=y in z" since it's introducing a new variable name zac. Its value is obtained by passing the first char on the tape and the hair into lust.

What does lust do with this char and this hair? It seems to be returning a new hair, based on its input hair, modified according to its input char. We haven't seen ?: before, but by analogy to ?@ and ?., it's probably a branching construct. It looks like C's ternary expression foo ? bar : baz; its probably the if-then-else (as opposed to ?. which is if-else-then). The condition is =(10 weq), i.e. "is this char a newline?" and if so, we're incrementing the hair's first value (with a prefix +) and setting the second value to 1; otherwise, we're incrementing the second value. (Again, I'm abusing imperative terminology.) So it looks like a hair must describe a position in the input by line number and column number; lust moves to the next position.

Maybe it's called hair because "hair" rhymes with "CHARacter"? Or because it's often part of word "HAIRline" or "crossHAIR"? And what about the name lust? Perhaps the hoon author is attracted to luscious locks, and thus "lusts after hair". Then again, it's listed in hoon.hoon right after this:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::                section 2eB, parsing (tracing)        ::
::
++  last  |=  [zyc=hair naz=hair]
          ^-  hair
          ?:  =(p.zyc p.naz)
            ?:((gth q.zyc q.naz) zyc naz)    :: gth == greater-than
          ?:((gth p.zyc p.naz) zyc naz)

That's essentially a lexicographic comparison between the two number-pairs, i.e., it returns the last of two hairs. (To be pendantic, in proper English we would say "latter", but that's not a four-letter word.) So maybe lust was just the first four-letter word to come to mind after last.

Anway, let's return to (next tub). We've just finished setting zac to the next hair in the file, so that we can return this:

[zac [~ i.q.tub [zac t.q.tub]]]

The inner expression, [zac t.q.tub] looks like an appropriate "next nail": zac is the next hair after p.tub and t.q.tub is the "cdr", or "rest" of the tape, i.e. what remains after popping off the "car", or "head" char which is i.q.tub. But why is zac needed a second time, and what is that devilish ~ doing there?

To answer these questions, we need to know how the return value is going to be used. So far, the objects we're thinking of as "parsing-units" haven't really been used as inputs to much except cold:

++  cold
  |*  [cus=* sef=_rule]
  |=  tub=nail
  =+  vex=(sef tub)
  ?@  q.vex
    vex
  [p=p.vex q=[~ u=[p=cus q=q.u.q.vex]]]

We saw |* above in the definition of list; it seems to have something to do with parameterized types. According to our understanding of cold (remember (cold 62 (just '-'))?), cus is the "constant semantic value" to be mapped onto the next character, and sef is a parsing-unit to filter what the next character can be. So now we know what to call a parsing-unit: the official type is rule (as in, a RULE on what characters are allowed to come next). Is that a new declaration digraph =_? Presumably not because %tscb doesn't appear in the file. Judging by other uses of _, it seems to be a character you put before a type to get some other type.

The implementation of cold matches our expectations of its behavior. First it passes the tub to sef, the original rule, and stores the result in vex. If sef failed, we pass the failure object up the stack. Otherwise, p=p.vex keeps zac, the next hair; q=q.u.q.vex keeps the next nail [zac t.q.tub]; p=cus (instead of p=p.u.q.vex) replaces the top char i.q.tub with the custom value cus. The mysterious ~ is untouched.

Let's try to understand this four-element structure better.

++  rule  |=(tub=nail `edge`[p.tub ~ ~ tub])            ::
++  edge  ,[p=hair q=(unit ,[p=* q=nail])]              ::

Apparently a rule is just a function from a nail to an edge. (So how is _rule a type that can be attached to the input of cold? Maybe this is a hint towards the "moronic type system" of hoon, wherein every type is actually just the range of a function.) The edge has four parts; the last one is a nail (whichever one was input to rule) and the first one is a hair (the hair of the nail). The middle two elements are part of something called a unit; they default to ~ and seem to have no discernable type information. We saw p=* as the type declaration above for cold. Maybe it means "any type"?

Grokking semsig

The definition of unit looks very similar to list, which I'll copy again:

++  unit  |*  a=_,*                                     ::  maybe
          $|(~ [~ u=a])                                 ::
++  list  |*  a=_,*
          $|(~ [i=a t=(list a)])

If you're familiar with monads, you might recognize unit as "the maybe monad", which is used to either wrap a value or indicate the absence of a value. The implementation again uses $| to declare an 'atom-or-cell' type, where the only allowed atom is ~, which indicates "null". The allowed cell is [~ u=a], where the first value in the cell is just a placeholder.

So then, looking back at that edge returned at the end of cold:

  [p=p.vex q=[~ u=[p=cus q=q.u.q.vex]]]

It starts with the hair, then has the non-null, cell-form for the unit<[*,nail]>, and the value of the unit is [cus, q.u.q.vex]. So we still don't know why the hair of the nail is duplicated at the top level of the edge. Maybe just for convenience? Or maybe it will sometimes turn out to have a different value. Well, what would make sense to go there? If we think of EDGE as meaning a one-dimensional span, maybe the two hairs (which, remember, are line/column character positions) correspond to the start and end of an expression. Maybe they're always the same so far because we've only been looking at one-character expressions. Let's test this theory by investigating plug, which we know can make a rule for a two-character expression.

++  plug
  |*  [vex=edge sab=_rule]
  ?@  q.vex
    vex
  =+  yit=(sab q.u.q.vex)
  =+  yur=(last p.vex p.yit)
  ?@  q.yit
    [p=yur q=q.yit]
  [p=yur q=[~ u=[p=[p.u.q.vex p.u.q.yit] q=q.u.q.yit]]]

Hmmm... every time we've seen plug it's been followed by a bunch of rules and preceded by the digraph ;~ to make a new rule, as in (cold ~ ;~(plug tis hep)). But plug doesn't take a list of rules, it takes an edge (the output of a rule) and a _rule. there's something fishy going on here between that _, and using |* instead of |=, and the use of ;~.

I have a hunch that if you know anything about monads, you might be way ahead of me right now. I don't know anything about monads. In fact, every time the word "monad" comes out of someone's mouth, I flinch involuntarily like a conditioned dog, because in my experience what generally comes out of that mouth next is going to be an unintelligible explosion of spittle. The one exception I've seen didn't explain why everyone else loves to mouthfoam about monads, but it did sketch some simple function compositions that seemed kinda handy and kinda remind me of what's going on here with plug.

So without actually trying to understand how this links up to fancy FP monads, let's see if we can reason through what must be happening. In ;~(plug tis hep), the ;~ rune operates on a list of functions ([tis hep]), given a function-combinator (plug). In this case, each function is a rule, which is a function of type nail -> edge, and each edge contains a nail. Let's assume this means that the functions listed inside a ;~ expression must all have the type x -> y while the function-combinator has type y, [x -> y] -> y, or in our case, edge, rule -> edge. The meaning of the ;~ expression is then a function which:

  1. takes in an x;
  2. passes it to the first function in the list to get a y;
  3. passes that, and the next function in the list, into the function-combinator to get a new y;
  4. keeps repeating the previous step until getting to the end of the list;
  5. returns the final y

In our case, plug will be passed the edge resulting from the tis (which it names vex), along with the hep. First it checks if the tis failed, and returns the failure if so. Then it runs hep on the remaining nail, storing the resulting edge in yit. It then compares the hairs of the two edges, putting the latter one in yur (It seems that this will always be p.yit; why is this comparison necessary? Can a rule somehow move the hair backwards?). It then checks whether the hep failed, and returns the failure if so. (Shouldn't this be done before the previous step?) Finally, it composes and returns an edge, which we know has four parts:

  1. a hair: we use yur
  2. a ~: a placeholder to distinguish a valued-unit from the null-unit
  3. a "semantic value": we use the cell of the semantic value returned by tis and the one returned by hep
  4. a nail: we use the one returned from the hep

Let's check this understanding of ;~ by looking at the other times we've seen it used. The first was on ;~(pose low hig).

++  pose
  |*  [vex=edge sab=_rule]
  ?@  q.vex
    =+  roq=(sab)
    [p=(last p.vex p.roq) q=q.roq]
  vex

In this example, vex is the edge produced by running low on the tape. If it was a success, we just return it. Otherwise, we set roq to be (hig). But wait, where's the argument in this function call? There must be some kind of a "default subject" here. Clearly what's being inserted is the same nail that went into low to produce the edge vex, but in the scope of this function we don't have access to it. Is this something like perl's default argument $_? Anyway, however it works, the resulting edge has the nail-unit from roq, but the hair is again the latter of our two hairs. This seems fishy. What if you posed two rules with different lengths? If the first rule fails, shouldn't we always use the hair from the second rule, even if the first hair was "laster"? Well, perhaps there's a contract about what the hair must be in an edge that failed. So far we've seen shim, mask, and just return (fail tub) on error.

++  fail  |=(tub=nail [p=p.tub q=~])

This is using the "null" value of the unit<[*,nail]>. (Incidentally, we can now guess the exact meaning of ?@: if must be "if-atom-then-else". When its argument is a unit, it will be an atom only if it is the null value, i.e. representing failure.) So if we assume a failed rule never advances the hair, then the comparison in pose isn't incorrect, but it is extraneous.

We also saw ;~(pfix bas ;~(pose bas soq bix)):

++  pfix
  |*  [vex=edge sab=_rule]
  ?@  q.vex
    vex
  =+  yit=(sab q.u.q.vex)
  [p=(last p.yit p.vex) q=q.yit]
++  sfix
  |*  [vex=edge sab=_rule]
  ?@  q.vex
    vex
  =+  yit=(sab q.u.q.vex)
  [p=(last p.vex p.yit) q=?@(q.yit ~ [~ u=[p=p.u.q.vex q=q.u.q.yit]])]
++  ifix
  |*  [fel=[p=_rule q=_rule] hof=_rule]
  ;~(pfix p.fel ;~(sfix hof q.fel))

Seemingly pfix works exactly like plug, except instead of composing the semantic values into a cell, it returns the last value and drops the others on the floor. Similarly, sfix returns the first value and drops the others on the floor. ifix does just what you'd expect with them. Note that ifix isn't used with ;~; it does all the lifting inside itself. Using a sufficiently tricky accumulator, it should be theoretically possible to write a version of ifix that would be called as ;~(ifix soq soq (boss 256 (more gon qit))) instead of (ifix [soq soq] (boss 256 (more gon qit))).

Now's a great time to do some exercises. ++ cook has nothing new in it; you should be able to write it based on what we already know. Try it out and check your answer. Did you miss anything? (Spoiler: I made one mistake, with the type declaration of the first parameter.)

Easy street

Let's go back and take a look at (easy ~) which we saw above.

++  easy
  |*  huf=*
  |=  tub=nail
  ^-  (like _huf)
  [p=p.tub q=[~ u=[p=huf q=tub]]]

We saw ^- (like char) above in just, mask, and next. It seems to declare a return type of an edge whose "semantic value" is a char. In this case, easy can put anything as the semantic value, and does not have any requirements or advance the tape. So it is a zero-length, zero-meaning parsing stub. This would seem to pose a parser problem, since if (easy ~) is ever a valid rule for the top of the tape, it would always stay a valid rule, and the parser would be stuck in an infinite loop. Fortunately, so far we've only seen it in two ways: the last element of a plug, where presumably it uses its ~ to terminate a list; and in a pose construct composing other zero-length parses (gon and gay). As long as none of these zero-length rules floats to the top of the parse stack, we should be okay.

Let's take a closer look at like:

++  like  |*  a=_,*                                     ::  generic edge
          |=  b=_`*`[(hair) ~]                          ::
          :-  p=(hair -.b)                              ::
          ^=  q                                         ::
          ?@  +.b  ~                                    ::
          :-  ~                                         ::
          u=[p=(a +>-.b) q=[p=(hair -.b) q=(tape +.b)]] ::

The first line, |* a=_,*, we saw before in unit and list: it declares a single unrestricted type parameter. Then we enter a lambda which introduces b. We know that a tec-quoted prefix is used to explicitly cast (remember ++moh?), but why would you ever cast something to *? The type of b before this cast is [(hair) ~], which is just a template for the emtpy edge (the emtpy hair followed by the null-unit). (Why write [(hair) ~] instead of (edge)?) Remember, (like char) is not just a type, it's a function which accepts all the things in that type: any edge with a char for a semantic value.

We haven't seen -.b before, but it looks like it just refers to the head of a cell; this is borne out by typing =+(b=[41 42] -.b) into the vere REPL; it spits back 41. Two lines later we see +.b which looks like it means the tail of a cell, and indeed the REPL confirms for us that =(42 =+(b=[41 42] +.b)).

Now what is :- doing? We expect the output of (like char) to be an edge<char>, that is, something like [p=(hair) q=[~ u=[p=(char) q=(nail)]]]. We see all those pieces in the implementation of like, but they're spread out. :- seems to be a way to assemble a cell over multiple lines, doing for [] constructns what %+ did for (). Playing in the REPL, it seems that :-(a b) has the same meaning as [a b], but only inside an expression. You can't just type :-(1 2) the same way you can type [1 2] at the prompt, but you can type =(|=([p=@ q=^] p) |=(:-(p=@ q=^) p)) or =(=+(:-(p=1 q=2) [p q]) =+([p=1 q=2] [p q])).

Again, we haven't seen ^= before, but ^=(q foo) seems to be another form of [q=foo]; that is, ^= can assemble the "named variable/type declaration" which we've seen over and over using infix =. So if we had enough columns we could write like on one line like this:

|*(a=_,* |=(b=_`*`[(hair) ~] [p=(hair -.b) q=?@(+.b ~ [~ u=[p=(a +>-.b) q=[p=(hair -.b) q=(tape +.b)]]])]))

Cheat Sheet:

Glyph Usage
soq...soq single-quoted text
pel...per function application
pat atom-type. equals 0. may be odorous.
tis type declaration, variable assignment, equality test
sel...ser cell; lambda multivariate type declaration
sig placeholder; null unit; end of list. equals 0.
bas...fas mid-string linebreak escape
bas string escaping
col backwards namespacing
com type quoting???
cen numeric literal prefix
dot backwards member addressing
lus increment
tar any type
cab type quoting???
tec casting
Digraph Usage
colcol begin a comment
luslus declare/define a funcion
bartis start a lambda gate
tistis terminate a tall list
hephep also terminate a tall list
kethep explicit cast
cenlus function application
barcen union types?
sigfas debug trace
wutpat if-atom-then-else
wutdot if-else-then
wutpam boolean and
bartar start a lambda gate with type polymorphism
bucbar atom-or-cell type
tislus declare variable pam add to scope
wutcol if-then-else
semsig monad-like function combination?
hepdot head of cell
lusdot tail of cell
colhep assemble cell
kettis assemble infix = (named variable / type declaration)
lusgar ???
Clone this wiki locally