Build Your Own Language, Part 1

In previous posts I talked about how one of my goals was to come up with an alternative to Gherkin. This is not in the “just to see if I can” category but rather because I believe it could be useful to roll your own Gherkin-like language. This post, however, will be in the “just to see if I can” category. Here I want to take what I learned about lexing and parsing and apply that to creating a simple programming language. Fair warning: I have no idea how this is going to turn out.

This post is definitely the equivalent of a development spike for me in that I plan on anything I do here being throwaway code. So let’s recap some things I talked about in previous posts on this topic.

The lexer is the part of a designed language that converts the input — the code you want to execute — into tokens that the parser can understand. What the lexer does is split the input it’s given into tokens. Each token is then tagged with the type of token it contains. All of this makes it easier for the parser to operate because the parser doesn’t have to worry about whether something is a string or a number or a constant or whatever other constructs your language has.

Tools like Rex and Lex — the former of which I looked at in my previous posts — are really not lexers. They are lexer generators or lexer compilers. You supply these tools a grammar and they will output a lexer.

By themselves, the tokens output by the lexer are really just the building blocks of a budding language. The parser then provides a context to those tokens by organizing them in a structure. Whereas a lexer produces an array of tokens, a parser produces a tree of nodes. There are two main kinds of tree: a parse tree is a record of the tokens that are used to match some input text. A syntax tree records the structure of the input and is independent of the grammar that produced it. The most common parser output is called an Abstract Syntax Tree (AST).

Just as there are lexer generators, as I mentioned above, there are also parser generators. Parser generators are commonly used to accomplish the often tedious task of building a parser from scratch. This in fact is what I used Racc for in some of my previous posts.

So for the purposes of this experiment, I’m going to start simple. Rather than worrying about the structure of a new programming language, I’m just going to see if I can handle one aspect of a programming language. Specifically, I want to be able to have this logic parsed:

if working == True
{
  print "Solidum petit in profundis."
}

From a lexing standpoint, here is what would get generated:

[IDENTIFIER if] [IDENTIFIER working] [EQUAL] [IDENTIFIER True]
[OPENBRACE {]
[IDENTIFIER print] [STRING "Solidum petit in profundis."]
[CLOSEBRACE }]

Although, as you’ll see, it’s actually a bit more complicated than that. But that’s what I want to handle. So what I want is a lexer that will give me that output so that eventually I can feed it to a parser. Now, in my prior post we used Rex to generate a lexer for us. It’s also possible — and, as I’ve found, easier — to roll you own lexer.

First, let’s create the most simple aspects of the lexer. Create a file called lexer.rb.

This is just a skeleton of what we need. You can see the only keywords I’ll have in my language so far. As with the lexer we generated with Rex, we have to create a tokenize method to actually break down the input. Here I’ve started the basis of that: I remove any newline from the end of the input string, and then I provide a tokens array that will hold the tokens that are found. The array of tokens will be returned from the method when all is said and done.

So now I’ll put the meat into the lexer:

Here I’m basically just using regular expressions to parse identifiers, constants, strings, newlines, and spaces. All of these we will be added to the tokens array, both with a specific symbol and the specific value of the token.

So now let’s test if this is working. Create a spec directory and within that create a lexer_spec.rb. This is where we’ll test out the logic. However, earlier I had mentioned that a barebones lexer output might be this:

[IDENTIFIER if] [IDENTIFIER working] [EQUAL] [IDENTIFIER True]
[OPENBRACE {]
[IDENTIFIER print] [STRING "Solidum petit in profundis."]
[CLOSEBRACE }]

Given the logic I put in the lexer.rb file, you can perhaps see that instead the lexer output will be this:

[:IF, "if"] [:IDENTIFIER, "working"] ["=", "="] ["=", "="] [:CONSTANT, "True"]
[:NEWLINE, "\n"]
["{", "{"]
[:NEWLINE, "\n"]
[:IDENTIFIER, "print"] [:STRING, "Solidum petit in profundis."]
[:NEWLINE, "\n"]
["}", "}"]

If, in fact, that is in no way clear to you at all, I recommend looking at the logic of the tokenize method. Look how the array token is returned. This is a useful exercise in terms of understanding how these things work. We already know our input data condition:

if working == True
{
  print "Solidum petit in profundis."
}

With this information, here’s the start of the logic I’ll use in lexer_spec.rb:

If you want to test some more isolated bits of functionality to prove to yourself how things work, you can use tests like these:

If you are playing around with this stuff, it’s really important to get a feel why the above logic works. A perfect way to do that is to try to modify aspects of it. As an example of that, let’s say that I want to recognize the braces as a particular symbol rather than just as a string. I can simply make that a specific check in lexer.rb, as such:

(I’ll assume that you can find the relevant place in the file to add this bit of logic.) That change would in turn require a change to the tokens array in the lexer_spec.rb file as such:

This is a pretty minimal start but I really recommend playing around with simple examples like I’ve provided here. Maybe it’s just my lack of searching ability, but I rarely find too much useful stuff like this on the web. (Granted, that presumes what I posted here is useful.) I find plenty of information about lexing and compiler construction but the information is often so obtuse — or over my head, if you prefer — that I find I can’t even get started. Hopefully what I’m providing here is a bit different.

I found it’s best to take things slow when learning these concepts and give people something to play with before delving too much into an extended example. What I plan to do next is use the knowledge from this post as well as my three posts on Rex and Racc — see part 1, part 2, part 3 — to give you a full walkthrough of building a language, from the lexer on down to a runtime that can actually execute aspects of the language you are creating.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.