Build Your Own Language, Part 3

Following on from the second post in this series, a lexer has now been created that can tokenize some input for a hypothetical language. Given what was accomplished, theoretically I can use my language to create the most minimal sort of class definition. I’d really love to know if I’m completely on the wrong track here. So I’d like to get started on the parser at this point rather than fleshing out the language more.

As with the lexer, we’ll do this test-first. I’ll create a parser_spec.rb file in my spec directory. Here’s what I’ll start with:

Now this gets a little tricky in that with the lexer, I rolled my own lexer (in lexer.rb) and was able to use that. However, the parser is going to be generated. I could roll my own parser if I wanted to … and had the skill, which I don’t. So this goes back to what I talked about regarding using a grammar file and Racc. So first I’ll create a file called grammar.y in my project directory.

This is the file that I will use with Racc to generate a parser.rb file. Will that help me fill in what goes in the “What goes here?” comment? Well, in my grammar.y file I’ll first declare the tokens:

These are the tokens, remember, that my lexer returns. Now I will have to define the production rules. Those are the rules that the parser will be using in order to structure the language. What that means, however, is that I’m going to need some method to parse those rules. So that is presumably what will go in my test method. That said, defining production rules is an exercise onto itself. So let’s focus on that first.

So first I’ll create a Root rule. This is the start of the tree, as it were. In fact, it’s the start of the Abstract Syntax Tree (AST) that I mentioned earlier. So under the token section of grammar.y, add this:

Here I’ve set up a rule section. The general format is that you will provide a rule name and a colon. That designates the start of a production rule. Here that rule is Root. It did not have to be Root. I could have called it whatever I wanted. The capital letter is not necessary either. The most basic format of a rule section for a grammar file is basically this:



So what’s the Expressions part? Well, that’s another rule! So, in fact, the format of a rule section can be this:

    rule_name { }

What I’m saying here is that the Root rule will be matched when an Expressions rule is matched. I have not yet defined the Expressions rule. Any code that I want to execute when the rule applies (or is matched) will go in those curly braces.

Any rules that you state exist must, in fact, exist. To see this, try generating your parser now with this command:

racc grammar.y -o parser.rb

You will be told: “terminal Expressions not declared as terminal.” What this means is that I said there was a terminal (a rule) but I failed to actually create that rule in the rule section. So let’s add that:

The idea of expressions is, of course, a plural of a single expression. So I’ve started off the rule that way. But, of course, now I need to define the rule for an Expression. This is the part where things can get interesting in terms of describing what you’re doing so I’ll take this part slow.

First, let’s add the rule for an expression:

Notice there that this gives us yet another format to consider:

    rule_name { }
  | rule_name { }

The | indicates an alternative way that the rule can be matched. So here I’m saying that an Expression rule is matched when a Class rule is matched or a Constant rule is matched. In my actual code, you may notice there are not action sections — curly braces — for the individual Expression rules. The actions to be taken will be in those specific rules — meaning in Class or Constant. Which means we need those rules. So adding to our growing list:

Note that when I use all capitals it refers to a token as opposed to a rule. So here I’m saying that a Class rule will be matched when there is a CLASS token followed by a CONSTANT token.

You should be able to compile this grammar now without error and that’s because every rule or token mentioned has now been defined.

racc grammar.y -o parser.rb

NOTE: If you are using Ruby 1.9.3. you may get this message when you generate the parser: “Use RbConfig instead of obsolete and deprecated Config”. This is actually okay and your parser.rb will be generated.

A slight problem at this point is that there are no actions. So the grammar, while valid, is useless. So now let’s fill in some of that logic. I’m going to go from the bottom of the list and work backwards. I’ll only show the relevant code to keep thsi simple. So first the Class rule:

If a given rule returns a value, that will be saved in a result variable. If I want my action code to reference anything to the left of the { }, I will use the variable val along with an index. So here I’m saying that when a Class rule is matched — because a CLASS token with a CONSTANT token has been found — then a new object of type ClassNode should be created. Further, the value of the CONSTANT token should be stored in the result of this action. The CONSTANT in this case would be the name of the class. So if this code were parsed:

class Movie

Then Movie would be returned from the action of the parser operating on that input. Put another way, when a Class rule is matched the result of that matching will be the name of the class.

But where is this ClassNode object coming from? Well, nowhere yet. We’ll have to build that logic. But let’s keep filling in actions for now. Next we’ll handle the Constant rule:

Here I’m using val[0] to return the value of the CONSTANT token.

It’s important to see why all this is working. These two rules — Class and Constant — are what tell the parser what to do when those situations are encountered. Situations? By that I mean those tokens. And how do we know the tokens? Because they are given at the start of the parser file. Okay — but then how does the parser know how to recognize when it’s one of those tokens. In other words it’s great that I have a token called CONSTANT. But how does the parser recognize what a CONSTANT actually is? Ah ha! Yes, that is a key part of this.

Remember: the lexer already determined this or, at least, is capable of determining this via its tokenize method. So taking a break from the actions for a second, let’s pause to realize that the parser must have a way to know what the lexer knows. Meaing, the parser must be able to get the tokens from the lexer. In order to do this, we’ll create a header section that calls in the lexer.

Note that the header section goes outside of the class entirely. What the header section does is tell the parser generator that the content in the section must go at at the head of the generated parser file.

Your parser will still compile (or be generated) at this point even though there is no such thing as a ClassNode or a ConstantNode. All that is being generated is the parser itself. This parser would not parse successfully were you in a position to test that. You will be soon enough but we’re not there yet.

Going back to our actions, now let’s add the action for our Expression rule:

What this will do is return the actual type of node that is created. Remember that an Expression rule is itself made up of two other rules: Class and Constant. Those are effectively going to return a symbol (“class”) and a constant value (“Movie”). Those will be the expression in this case.

Finally, what about the action for Expressions? Well, honestly I don’t know yet. But let’s start with this:

All that’s basically saying is that the value of the Expression rule will be returned as part of the Expressions rule. So why have Expressions and Expression? Mainly because this allows your language to have multiple expressions as part of one line. In the case where the line is just a single expression, then Expressions and Expression are essentially equivalent.

At this point, without those nodes we can’t go too much further. And yet I believe I now know enough to at least start on my test. I need an instance of the parser. Just as I need a tokenize method with my lexer, I’m going to need a parse method with my parser. It’s that method that I have to call during my test. The return value to check from a call to that method is whatever gets returned from those nodes I just defined.

So in my parser_spec.rb file, I’ll add the following to my test:

As with the lexer, I’m going to need to define the variable class_test as well as the nodes variable. This is very similar to what I did with the lexer_spec.rb file. Modify parser_spec.rb with the following:

Again, what I did here is very similar to what I did in lexer_spec.rb. I provided some test input and I provided what should happen in order to get me some values in a nodes variable.

How did I know to use that specific logic to populate the nodes variable? Because of the actions I defined in the grammar rules:

  • Expression action: result =
  • Constant action: result =[0])
  • Class action: result =[1])

So I’m going to have expression (and thus create a Node instance) and that expression will be made up of a class (and thus create a ClassNode instance).

What’s missing here is a way for this test to pass because I don’t have any of those nodes defined. Those nodes are part of the Abstract Syntax Tree. Since we’ve come pretty far here, I’ll stop now and cover those details in the next post.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.