As a tester, I like to know how things work. So when I started investigating the basis for the tool Cucumber, I came across the concept of Gherkin. That was written with the help of a system call Ragel. What I was curious about was whether I could build one of these languages on my own, perhaps for a customized testing tool. This caused me to stumble upon Rex and Racc, which are two Ruby-based tools that help you build your own languages. The documentation on these tools, however, is on the bad side of awful. This post is purely to document and share what I learned.
I should probably state up front that this will not just be a post where I talk about a lot of stuff. I do have examples that I will be putting together as I learn all this, so if you have Ruby installed, feel free to follow along. If you do so, make sure you have the rexical and racc gems installed. Note that the Ruby gem called rexical is often referred to as Rex.
Investigating this stuff leads you pretty quickly to Lex and Yacc, two very popular and long-standing tools. In the Ruby world, Rex and Racc are basically those tools, but written entirely in Ruby. So let’s talk about what they do.
- Rex (like Lex) breaks down information into sets of “tokens.” Think of these tokens as words.
- Racc (like Yacc) takes sets of tokens and assembles them into higher-level constructs. Think of these “constructs” as sentences.
The two tools pretty much go hand-in-hand. Racc is designed to work with the output of Rex, just as Yacc is designed to work with the output of Lex. Similary, Rex’s output — just like that of Lex — is generally designed to be fed into some kind of parser.
So why would you use these tools? Well, Rex and Racc can be used to parse grammars that are simple and, more importantly “regular.” This means that there is a structure to the grammar that repeats and acts as an organizational guide to how the grammar can be expressed. Because of this, natural languages are not something you would model with Rex or Racc. However, these tools can serve to provide a structural language that surrounds a natural language. In fact, in that admittedly simplistic formulation, this is how Gherkin and Cucumber work together. Cucumber allows you to use natural language sentences but the elements that are structured around those sentences — the Given, When, and Then clauses — come from Gherkin.
So let’s get into the tools a little bit here. While you can use Rex and Racc entirely independent of each other, I haven’t yet found too many cases where I would want to. So I’ll start with Rex and then move on to Racc. (Actually, I probably won’t get to Racc until a later post.)
Rex is a lexical analyzer. That means it’s a program that breaks provided input into recognized pieces. As an example, a lexical analyzer might take as input a written document and count the words in that document. That would imply there is a rule that specifies how to recognize a word. It might be a simple rule, like “any bit of text that is separated by spaces.” When using Rex, you will construct a “specification file.” This file will tell Rex how to build a corresponding lexical analyzer. That lexical analyzer will be generated in Ruby code. (Lex does the same thing but outputs C code instead.)
So what goes in this specification file? Within this Rex file, you will have a series of rules that Rex translates into the lexical analyzer. Each rule, in turn, consists of a pattern and some code to be run when that pattern is matched. Any text that isn’t matched is simply copied to standard output. The idea is that you want to make sure all possible inputs are caught by the lexer or that the lexer appropriately handles situations where it finds input that it does not know how to match.
Let’s play around a bit. Follow these steps:
- Create a folder called test_language. This will be our project folder.
- Within that folder create a file called test_language.rex. This will be our specification file.
The main responsibility of Rex is to tokenize the code of a language into unit tokens, called terminals. Generally the goal is to allows those terminals to be understood by a parser, like Racc. For now, however, we’ll just use Rex on its own. Enter the following into your file:
1 2 3 |
class TestLanguage end |
To make sure things work, let’s compile this file. Perform the following command:
rex test_language.rex -o lexer.rb
Here I’m just using the rexical tool to compile the rex file into a file called lexer.rb. If you look at the generated file, you will see that the generated class will inherit from Racc::Parser. So let’s keep in mind what this lexer is for. It’s for performing lexical analysis, of course. Lexical analysis, however, is just the first phase. In this case taking various inputs and converting them into a stream of symbols which are fed in to the second phase, the parser. Any lexer should raise errors indicating invalid characters and anything it can’t find a symbol for. Take a bit of time to look at the generated lexer.rb file. Don’t worry about understanding all of it. Just get a feel for it.
Also note that any time you make a change to the test_language.rex file, you must rerun the above command to generate a new lexer.rb.
In order for this lexer to be in any way useful, you do have to provide some rules. Specifically, you will include a rules section. The rules section associates regular expression patterns with Ruby logic. The idea is that when the lexer sees text that matches a pattern you specified, the lexer will execute the associated Ruby code. Add the following to your test_language.rex file:
1 2 3 4 5 |
class TestLanguage rule u { puts "Single u." } uu { puts "Double u." } end |
Here you have two rules that specify if a single u is input and matched, a certain bit of logic should execute. If two u’s together are input and matched, a certain other bit of logic should be executed. Think of these lines as rex expressions. These expressions are patterns. Here the pattern is simple: literal values of ‘u’. But the patterns can (and usually will) be specified by regular expressions. The idea is that when provided input, Rexical will search for strings that match your rex expressions.
There are, however, some tricky bits to this matching. To show that, create a file called test_language.rb. Then add the following to it:
1 2 3 4 5 6 |
require './lexer.rb' class TestLanguageTester @evaluator = TestLanguage.new @evaluator.tokenize("u") end |
Now try to run that file. You will be told that there is no tokenize method to execute. Well, that’s true — there isn’t. You have to create one. So add the following to your test_language.rex file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
class TestLanguage rule u { puts "Single u." } uu { puts "Double u." } inner def tokenize(code) scan_setup(code) tokens = [] while token = next_token tokens << token end tokens end end |
The call to scan_setup is a call to a method in the lexer.rb fle. This sets up the lexical scanner to do its thing. Likewise, a call is made to the next_token method, also defined in lexer.rb. This basically grabs the next token from the rule section of the file. Notice how the tokenize method that we defined is placed in an inner section? That’s important because what happens is that anything in an inner section is copied into the class in the lexer.rb file. If you check, you’ll find your tokenize method inside lexer.rb near the bottom of the file.
Make sure to regenerate the lexer.rb file and then run your test_language.rb file again. You should get this output:
Single u.
Okay, now change the test_language.rb file like this:
1 2 3 4 5 6 |
require './lexer.rb' class TestLanguageTester @evaluator = TestLanguage.new @evaluator.tokenize("uu") end |
Run the file again. This time you get ….
Single u. Single u.
Hmm. Is that what you expected? Or did you expect to receive the text “Double u”? What’s happening here is that when given input, Rex tries each rule in order, matching the longest stream of input that it can. Here’s where I enter into some confusion, though. It seems that in the world of Lex, preference will be given to longer matches even if they are later in the file. That does not seem to be the case in Rex. To test this change the test_language.rex file as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
class TestLanguage rule uu { puts "Double u." } u { puts "Single u." } inner def tokenize(code) scan_setup(code) tokens = [] while token = next_token tokens << token end tokens end end |
Here I’ve just switched the ordering of the rules. Now try to run your logic again and you’ll find the following:
- @evaluator.tokenize(“u”) results in “Single u.”
- @evaluator.tokenize(“uu”) results in “Double u.”
So let’s try one more addition to the test_language.rex file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
class TestLanguage rule uu { puts "Double u." } u { puts "Single u." } uuu { puts "Triple u." } inner def tokenize(code) scan_setup(code) tokens = [] while token = next_token tokens << token end tokens end end |
Now you should find the following:
- @evaluator.tokenize(“u”) results in “Single u.”
- @evaluator.tokenize(“uu”) results in “Double u.”
- @evaluator.tokenize(“uuu”) results in “Double u.” and “Single u.”
If you were to put the triple u rule first, above all the other rules in the rex file, you would find the following:
- @evaluator.tokenize(“u”) results in “Single u.”
- @evaluator.tokenize(“uu”) results in “Double u.”
- @evaluator.tokenize(“uuu”) results in “Triple u.”
So the trick here is that the ordering matters: Rex is finding the longest pattern it can in the order that it is reading the rules. Play around with this simple example to see how things work.
If something doesn’t match any rules at all, Rex will provide an error saying that it could not match that particular token. What some people will do is add a last rule that matches anything and either does nothing at all or prints out some kind of friendly message. Here’s an example of what you could do:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
class TestLanguage rule uu { puts "Double u." } u { puts "Single u." } uuu { puts "Triple u." } . { puts "Could not match." } inner def tokenize(code) scan_setup(code) tokens = [] while token = next_token tokens << token end tokens end end |
The . is an example of a regular expression, which is basically saying match anything. Try @evaluator.tokenize(“y”) and you should get the message “Could not match.” You should find that all the previous examples work as before.
Incidentally, the way I’ve been having you do this is probably pretty poor. What would be more effective is setting up a test file that you can use and run as you build up your knowledge. I’ll cover what I did in that regard in a follow up post to this one.
I’ll close here by saying that for someone who knows tools like Lex and Yacc, everything I said here is probably utterly simplistic. I have to say, however, that for me learning this stuff was a slog in that I was learning it from the basis of Ruby and, as mentioned, the documentation on Rexical and Racc vacillates between entirely nonhelpful and stubbornly nonexistent. So I hope this post may at least be something that shows up in search results for those trying to learn the tools as I was.
More importantly, I do think there is benefit for testers to learn tools like these. The idea of language construction is quite important in a time when tools promoting various forms of domain-specific languages challenge us to figure out how to use a structured but limited language to express concepts. Being able to understand how these things work was important to me. I plan on doing a few more posts on Rex and Racc as I learn more.
Thanks for the post, it got me interested in learning more about these, I’ll try and follow along and I hope you do some more posts on them
Thanks indeed! If you spot any errors or confusions, let me know. There really isn’t a great deal of documentation out there and I’m learning this as I go. So I have no doubt I may make a few stumbles here and there. I’m happy to correct when I do.
I was too facing the similar problems while searching for Rex. No proper documents. I am glad, i came upon your article.
I have some assignment on rex…and this particle has got me jumping. 🙂
Thanks for putting this together. Very well laid out.