Skip to content

Latest commit

 

History

History
106 lines (68 loc) · 9.81 KB

Getting_Started.md

File metadata and controls

106 lines (68 loc) · 9.81 KB

Getting Started

Scala parser combinators are a powerful way to build parsers that can be used in everyday programs. But it's hard to understand the plumbing pieces and how to get started. After you get the first couple of samples to compile and work, the plumbing starts to make sense. But until then it can be daunting, and the standard documentation isn't much help (some readers may remember the original "Scala By Example" chapter on parser combinators, and how that chapter disappeared from subsequent revisions of the book). So what are the components of a parser? How do those components fit together? What methods do I call? What patterns can be matched? Until those pieces are understood, you can’t begin to work on your grammar or build and process abstract syntax trees. So to minimize complexity, I wanted to start here with the simplest possible language: a lowercase word. Let’s build a parser for that language. We can describe the grammar in a single production rule:

word -> [a-z]+

Here’s what the parser looks like:

import scala.util.parsing.combinator._
class SimpleParser extends RegexParsers {
  def word: Parser[String]    = """[a-z]+""".r ^^ { _.toString }
}

The package scala.util.parsing.combinator contains all of the interesting stuff. Our parser extends RegexParsers because we do some lexical analysis. """[a-z]+""".r is the regular expression. ^^ is documented to be "a parser combinator for function application". Basically, if the parsing on the left of the ^^ succeeds, the function on the right is executed. If you've done yacc parsing, the left hand side of the ^^ corresponds to the grammar rule and the right hand side corresponds to the code generated by the rule. Since the method "word" returns a Parser of type String, the function on the right of ^^ needs to return a String.

So how do we use this parser? Well, if we want to extract a word from string, we can call

SimpleParser.parse(SimpleParser.word(myString))

Here’s a little program to do this.

object TestSimpleParser extends SimpleParser {
  def main(args: Array[String]) = println(parse(word, "johnny come lately"))
}

Two things to notice here:

  • The object extends SimpleParser. That gets us around having to prefix everything with "SimpleParser".
  • When we run this, we don’t get back the "word" we parsed, we get a ParserResult[String] back. The "String" type parameter is needed because the method named "word" returns a result of type Parser[String], and the type parameter carries through to the ParseResult.

When we run the program, we get the following at the console:

[1.7] parsed: johnny 

That says that the first character of the input that matched the parser is position 1, and the first character remaining to be matched is in position 7. This is a good start, but all of this should suggest that we are missing something because we have a ParseResult, but not the thing we want, which is the word. We need to handle the ParserResult better. We could call the "get" method on the ParseResult. That would give us the result, but that would be making the optimistic assumption that everything worked and that parsing was successful. We can't plan on that because we probably can't control the input enough to know that it is valid. The input is given to us and we have to make the best of it. That means detecting and handling errors, which sounds like a job for pattern matching, right? In Scala we use pattern matching to trap exceptions, we use pattern matching (Options) to branch for success and failure, so you would expect to use pattern matching to deal with parsing as well. And in fact you can pattern match on the ParseResult for the various termination states. Here’s a rewrite of the little program that does a better job:

object TestSimpleParser extends SimpleParser {
  def main(args: Array[String]) = {    
    parse(word, "johnny come lately") match {
      case Success(matched,_) => println(matched)
      case Failure(msg,_) => println("FAILURE: " + msg)
      case Error(msg,_) => println("ERROR: " + msg)
    }
  }
}

In comparison to Option, which has two primary cases (Some and None), the ParseResult basically has three cases: 1) Success, 2) Failure, and 3) Error. Each case is matched by a pattern of two items. In the Success case, the first item is the object produced by the parser (a String for us since "word" returns a Parser[String]), and in the Failure and Error cases, the first item is an error message. In all cases, the second item in the match is the remaining unmatched input, which we don’t care about here. But if we were doing fancy error handling or subsequent parsing, we would pay close attention to. The difference between Failure and Error is that on a Failure, parsing will backtrack when parsing continues (this rule didn't work but maybe there is some other grammar rule that will), whereas the Error case is fatal and there will be no backtracking (you have a syntax error, there is no way to match the expression you have provided with the grammar for this language, edit the expression and try again).

This tiny example shows a lot of the necessary parser combinator plumbing. Now let’s look at a slightly more complex (and admittedly contrived) example to bring forward some of the remaining plumbing. Say that what we are really after is a word followed by a number. Pretend that this is data about the frequency count of words in a long document. Of course, there are ways to do this by simple regular expression matching, but let’s take a slightly more abstract approach to show some more combinator plumbing. In addition to words we will also have to match numbers, and we will have to match words and numbers together. So first, let’s add a new type to gather words and counts. Here is a simple case class for that:

case class WordFreq(word: String, count: Int) {
  override def toString = "Word <" + word + "> " +
                          "occurs with frequency " + count
}

Now we want our parser to return instances of this case class rather than instances of String. In the context of traditional parsing, productions that return primitive objects like strings and numbers are performing lexical analysis (aka tokenization, typically using regular expressions) whereas productions that return composite objects correspond to the creation of Abstract Syntax Trees (ASTs). Indeed, in the revised parser class, below, the words and numbers are recognized by regular expressions and the word frequencies use a higher-order pattern. So two of our grammar rules are for tokenization and the third builds the AST:

class SimpleParser extends RegexParsers {
  def word: Parser[String]   = """[a-z]+""".r       ^^ { _.toString }
  def number: Parser[Int]    = """(0|[1-9]\d*)""".r ^^ { _.toInt }
  def freq: Parser[WordFreq] = word ~ number        ^^ { case wd ~ fr => WordFreq(wd,fr) }
}

So what’s to notice here, in this new program? Well, the parser for "number" looks just about like the parser for "word", except that it returns a Parser[Int] rather than a Parser[String], and the conversion function calls toInt rather than toString. But there is a third production rule here, the freq rule. It:

  • Doesn't have a .r because it isn't a regular expression (it's a combinator).
  • Returns instances of Parser[WordFreq], so the function to the right hand side of the ^^ operator had better return instances of the composite type WordFreq.
  • Combines the "word" rule with the "number" rule. It uses the ~ (tilde) combinator to say "you have to match a word first, and then a number". The tilde combinator is the most common combinator for rules that don't involve regular expressions.
  • Uses a pattern match on the right side of the rule. Sometimes these match expressions are complex but many times they are just echoes of the rule on the left hand side. In that case, all it really does is gives names to the different elements of the rule (in this case "wd" and "fr") so that we can operate on those elements. In this case, we use those named elements to construct the object we are interested in. But there are also cases where the pattern match is not an echo of the left hand side. Those cases may arise when parts of the rule are optional, or when there are very specific cases to match. For instance, if we wanted to perform special handling in the case where fr was exactly 0. For that, we could have added the case:
case wd ~ 0

Here is a very slightly modified program to use this parser:

object TestSimpleParser extends SimpleParser {
  def main(args: Array[String]) = {
    parse(freq, "johnny 121") match {
      case Success(matched,_) => println(matched)
      case Failure(msg,_) => println("FAILURE: " + msg)
      case Error(msg,_) => println("ERROR: " + msg)
    }
  }
}

There are only two differences between this little program and the previous one. Both of those differences are on the third line:

  • Instead of using the "word" parser, we use the "freq" parser because those are the kinds of objects we are trying to get from the input, and
  • We changed the input string to match the new language.

Now when we run the program we get:

 Word <johnny> occurs with frequency 121

At this point, we’ve shown enough of the parser combinator plumbing to get started and do something useful. Hopefully, all of that other documentation makes a lot more sense now.