A Tester Learns Rex and Racc, Part 3

This post continues on directly from the second post in this series. Assuming you’ve been following along, you’ve broken your input into a stream of tokens. Now you need some way to recognize higher-level patterns. This is where Racc comes in: Racc lets you describe what you want to do with those tokens. That’s what I’ll be covering here: how the parser works with the lexer.

You already know that you define a lexer specification with a rex file. This generates a lexer class. In a very similar fashion, you define a grammar with a racc file. This generates a parser class. If you want your own parser, you have to write grammar file. In the last post we started creating a new simple language to handle calcuations of numbers. The language will consist of ADD (+), SUBTRACT (-), MULTIPLY (*), and DIVISION (/). Those operators are given as the symbol identifier as well as the actual symbol that maps to that identifier. The language also supports the notion of DIGIT. A DIGIT token must come before and after an operator. So there’s our language. Now we want to try to evaluate the code of this new language.

You have to create a grammar file. The first part of this file, as with the Rex specification file, is a class block. In your project directory, create a file called test_language.y. This will be our grammar file. As with the start of our Rex file, we’ll use our class:

Then you will have a grammar block. In Rex, the grammar block is a way to specify the patterns for specific symbols. In Racc, the grammar block describes grammar that can be understood by the parser. As with Rex specification files, we’ll eventually add a rule clause to signify our grammar block. In fact, this clause will mark the start of what are called the “production rules.” To fill in these rules, you have to understand the Racc grammar. So let’s just look at the basics before we try to fill in our nascent grammar file.

A grammar rule has the form:

some  :  THING  ;

The symbols produced by a lexer are called terminals or tokens. The things assembled from them are called non-terminals. So in this schematic, ‘some’ represents a non-terminal while ‘THING’ is a terminal. So ‘THING’ must have been produced by the lexer and ‘some’ can only be created by assembling it from terminals. Here’s an example:

value : DIGIT ;

This means a construct called a value (a non-terminal) can be built up from a DIGIT (a terminal). Here’s another example:

expression : value '+' value ;

Here an expression can be built up from two values (which we know are DIGITS) and a plus sign. You can specify alternatives as well:

expression : value '+' value | value '-' value ;

Some people like to space all this out for better readability:

expression
  :
    value '+' value
  | value '-' value
;

There’s a lot of terminology coming your way here so, for now, just understand that terminals are all the basic symbols or tokens of which a given language is composed. Non-terminals are the syntactic variables in the grammar which represent a set of strings that the grammar is composed of. The general style used is to have all terminals in uppercase and all non-terminals in lowercase.

As with Rex grammar, you can write actions inside curly braces. These actions are, once again, Ruby code. Since Racc needs tokens upon which to operate, Racc expects to call a method that yields tokens, where each token is a two element array with the first element being the type of token (matching the token declaration) and the second element the actual value of the token, which will usually be the actual text. This is a lot easier to show than talk about, although in reality it’s not so different from what you’ve seen with Rex files. So here’s an example:

expression : value '+' value { puts "Performing addition." }

But I could also do this:

expression : value '+' value { result = val[0] + val[2] }

If you ever read documentation on yacc, you’ll see mention of variables like $$ and $1 and so on. The equivalent of the above in yacc would be:

expression : value '+' value { $$ = $1 + $3; }

Here yacc’s $$ is racc’s ‘result’, while yacc’s $0, $1 would be the array ‘val’. Since Ruby is zero based, that’s why $1 is val[0].

Okay, so let’s put this stuff to use. Change your grammar file so it looks like this:

What I’m basically saying here is that something I want to call an “expression” can be recognized if it’s a DIGIT or a DIGIT ADD DIGIT. From the lexer file, we know that DIGIT is going to correspond to an array like this: [:DIGIT, text.to_i]. We know that ADD is going to correspond to an array like this [:ADD, text]. Further, we know that DIGIT means \d+ (i.e., any group of numbers) and ADD means \+ (i.e., the literal plus sign). I’m also saying that when DIGIT is encountered nothing should happen. There is no action provided for that. But if DIGIT ADD DIGIT is encountered, then I want a return value provided, which is the addition of the first DIGIT with the second DIGIT.

In order to compile this file, you can use the following command:

racc test_language.y -o parser.rb

To make that process more streamlined, let’s add a task to our Rakefile:

Now you can just type ‘rake parser’ to generate the parser just as you can type ‘rake lexer’ to generate the lexer. In fact, you might always want to force both things to happen, in which case you can add this task:

So how do we test this? Let’s create a file in our spec directory called language_parser_spec.rb and add the following to it:

Structurally you can see that this is similar to the test format we established in our language_lexer_spec.rb file. Running this test will tell you that there is no parse method. That’s true. Just as we added a tokenize method to our rex specification, we have to add a parse method to our grammar file. Here’s how you can do that:

Once again, we use an inner section just as we did with our rex specification. Notice a few major differences here, though. First, you must preface the section with four hyphens (—-). Also, notice that the inner section is OUTSIDE of the class block. This is different from the rex specification where the inner section was inside the class block. Make sure to regenerate your files and then run the test again. Now you will be told that the scan_str method is undefined. This method is defined in the lexer and so you have to add another section to your grammar file:

If you regenerate and run your test, the parser test should pass. However, we have not made it do a calculation yet. So add the following test:

If that’s working for you, congratulations! You now have the start of a working language. Granted, it’s just a calculator example but let’s not leave it just doing addition. You can fill out the rest of the grammar file as such:

I’ll leave it as an exercise to fill in the existing tests to prove that the grammar for all calculation types is being read and processed correctly.

This simple example still leaves a lot open. For example, while “2+2” will work “2 + 2” (with spaces) will not. This is because your lexer did not specify any whitespace and thus the parser does not know how to recognize it. Also, what about more complicated expressions? Can I do “2+2+2”? Try it and you’ll find that it does not. And even if I could support that, how would I handle operator precedence? For example would “2+3*2” give me 8, as it should, or would it instead give me 10?

Keep in mind that in order to solve problems like this, you have to account for how your grammar can be constructed. For example, to allow for spaces, you just have to indicate that (1) blank spaces are allowed and (2) that no action occurs when a blank space is matched. In fact, you’ve already seen this in prior examples here. Just add the following to your rex specification file (test_language.rex):

That simple bit of extra information allows your parser to handle “2 + 2” as well as it does “2+2”. The other problems are a little more involved. And with that, perhaps you can see now that even with a simple calculator, it takes some work to get it to do what you want in terms of a language. And yet all programming languages that you work with are ultimately constructed of grammar specified just in the way we have been doing it here.

I’ve learned a lot by trying to figure out how these tools work. One of my goals is to see if I can build an alternative to Gherkin, the language that is used by Cucumber to create what are called feature files. I think this will be a challenging exercise. Beyond that, I can see some good exercises here for constructing test grammars for test description languages. Playing around with concepts like these could, perhaps, lead to some interesting notions about how we communicate about tests.

Share

About Jeff Nyman

Anything I put here is an approximation of the truth. You’re getting a particular view of myself … and it’s the view I’m choosing to present to you. If you’ve never met me before in person, please realize I’m not the same in person as I am in writing. That’s because I can only put part of myself down into words.

If you have met me before in person then I’d ask you to consider that the view you’ve formed that way and the view you come to by reading what I say here may, in fact, both be true. I’d advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

This entry was posted in Rex and Racc. Bookmark the permalink.

4 Responses to A Tester Learns Rex and Racc, Part 3

  1. Dan says:

    This is such an excellent series of articles, Jeff. Really well written. Thank you so much for writing them.

    I’m trying to teach myself to make a simple language at the moment and these articles probably saved me days of effort and confusion.

    Is there a way I can buy you a beer? Because I will…

    • Jeff Nyman says:

      Thanks, Dan. Your comment is much appreciated. This is an area I want to explore again soon, particularly since I’ve been working on a tool called Lucid that is somewhat like a Cucumber clone. That means it uses Gherkin. I would like to explore writing languages a little more in that context. I made a small start with that in my Lucid TDL project. I plan to document what I did there. I’d also like to explore Ragel a bit more, since it is quite a powerful solution in its own right.

  2. andy says:

    I just read all 3 posts. This is very well done. Thanks for putting the effort into doing this.

  3. Joel says:

    Seriously useful series of articles. This was by far the most useful documentation I have found for either of these tools. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *