Build Your Own Language, Part 2

Following on from the first post in this series, I did a development spike that proved to myself I had some clue as to what I was doing. So now I want to get into actually building a language from soup to nuts: meaning I want an interpreted language that executes via a runtime. Granted, this will be a very simple language (and not very unique). It will also be a very simple runtime. Let’s see what happens, shall we?

I am largely indebted to the book How To Create Your Own Freaking Awesome Programming Language. That book definitely did put me on the right track. That said, I have a hard time recommending the book. My problem with it is that I think the material is presented in a somewhat disjointed way, not leading to a very cohesive understanding of the overall set of tasks that you are performing. In fact, these blog posts are part of my attempt to put some order to what is described in the book.

Another book I found really useful in a “read it a lot and you may start to understand way” was Writing Compilers and Interpreters. (I used the third edition which focuses on Java rather than C or C++.) Another book that was only moderately helpful but did fill in some gaps was Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages.

I don’t want to do anything spectacular here since I’m just doing this as an experiment to lead up to my real project: a Gherkin-like alternative. So what I want to do is create a Ruby-like language but one that uses the curly brace syntax of C, C++, and Java. Why would I want to do this? Well, I had to do something and this seemed like a relatively easy thing to get started with.

So here is some sample code that ultimately I would like to get working:

class Movie
{
  def name
  {
    puts "The Amazing Spider-Man"
  }
  
  def year
  {
    puts 2012
  }
}

film = Movie.new
puts film.name
puts film.year

Consider that my requirement spec for the language I need to support and the tool I need to build to support it. And, yes, I do realize this is not good code in that any instance of this class would always return the same movie. The reason I’m starting this way is because it will force me to consider what the lexer has to deal with initially, in terms of common structures like constants, newlines, keywords, strings, and numbers.

Incidentally, the How To Create Your Own Freaking Awesome Programming Language book does a pretty cool thing: it provides a Ruby language with Python indentation. Due to the license of the book and its source, I don’t think I can show that here.

Okay, so we have the code we want to support. Also, we know a few things from my past posts on these topics:

  1. This code has to be provided as input to a lexer.
  2. The lexer will convert that input into tokens.
  3. Those tokens are going to be used by a parser.
  4. The parser will organize those tokens into a structure.
  5. That structure will be a tree of nodes.

Put in a nutshell, a specific grammar will be converted into a parser that can compile lexer tokens into nodes.

Now here’s where I get into some new territory. As with any language, my new language needs to have its data types and its structure represented in memory. So here are some new things I have to explore:

  1. The parser simply determines how the language can be structured.
  2. A runtime is what allows that structure to execute.
  3. In order for a runtime to work, you have to evaluate the nodes.
  4. An interpreter is what evaluates any code that was passed to it.
  5. The interpreter reads the syntax tree (from the parser) and executes each action associated with a node (from the runtime).

Put another way: once you have a runtime, you have to map the nodes you created to the runtime via an interpreter.

So, just one more time, let’s make sure we have the path down:

  • Code (as input) goes into lexer.
  • Lexer produces tokens.
  • Parser takes tokens and creates nodes.
  • Interpreter evaluates nodes and produces objects.
  • Runtime evaluates the objects.

To make all these bullet points a little more specific, let’s say I have the following:

puts "The Amazing-Spider Man"
puts 2012

Passed through the lexer, these expressions become tokens:

[:IDENTIFIER, puts]
[:STRING, "The Amazing Spider-Man"]
[:IDENTIFIER, puts]
[:NUMBER, 2012]

Those tokens become nodes:

[:call, "puts",[:string, "The Amazing Spider-Man"]]
[:call, "puts",[:number, 2012]]

Those nodes become structured into objects and calls that can be evaluated:

Object.call(:puts, "The Amazing Spider-Man")
Object.call(:puts, 2012)

Those structures should then be capable of being executed by a runtime.

Start With Lexing

In a previous post regarding Rex, I showed you how to use a lexer generator to create your lexer. In the first post in this series I gave some idea of how to roll your own lexer rather than using a lexer generator. Since I’ve already beat that topic close to death, I’ll do something a little different here which is show you how I develop the lexer in a TDD fashion. So here’s what I do:

  1. Create a project directory called learn_language.
  2. Create a directory in the project directory called spec.
  3. Create a file in the spec directory called lexer_spec.rb.
  4. Create a file in the project directory called lexer.rb.

Here’s the start of the code in lexer_spec.rb:

To get that to work, I need the minimum in lexer.rb, which is:

There’s no actual test involved yet. So let’s put a test in that will drive how I want the lexer to work:

This means I’ll need a tokenize method in my lexer. I’ll pass that method some input called class_test so I’ll need to define that. I’m going to compare whatever the lexer returns with the contents of class_test_tokens so I’ll need to define that as well. Let’s define class_test and class_test_tokens in lexer_spec.rb:

So here I’m testing a very simple and constrained aspect of my overall requirement spec. I’m just going to make sure that I can create a class definition. Now to make this work, we’ll add some code to lexer.rb. What I will do here that I didn’t do in previous posts is explain a bit more about what I’m doing.

So here’s what I’ll add to get started:

I set up a constant to hold my keywords. Currently the only keyword I’ll have is “class.” From my test, I know I need a tokenize method in my lexer. This method is the work horse of the lexer. First I remove any newline characters with chomp. Chomp removes the last character from a string only if that character is a newline. Then I set up a character position variable (char_pos) and set it to zero since no lexing has taken place yet. Finally, I establish an empty tokens array. This array is what will be returned from the lexer.

What I now need is a loop that goes through the provided input and does the actual tokenizing. The loop is going to break the input up into chunks. Those chunks will then be mapped to tokens or can be “thrown away” if they are not needed. So here’s the next bit of logic:

So here I’m going to be getting a chunk of the input based on the current character position up to the end of the line. The character position will, of course, be increased as the lexer loops through the input. The above, of course, will not work because we have no way actually getting chunks of the input. We don’t want each individual character of the input; we want groupings of the input to become tokens.

So let’s add this:

Here I’m saying if you find a string that is composed solely of lower case letters and no spaces, then that’s a symbol. If that symbol also happens to be found in the KEYWORDS array, then it will be a token with the name of that keyword. Otherwise, it is a generic symbol.

Let’s try it via IRB:

$ ./Projects/learn_language: irb
irb(main):001:0> require './lexer.rb'
=> true
irb(main):002:0> input = <<-INPUT
irb(main):003:0" class Movie
irb(main):004:0" {
irb(main):005:0" }
irb(main):006:0" INPUT
=> "class Movie\n{\n}\n"
irb(main):007:0> lexer = Lexer.new.tokenize(input)
=> [[:CLASS, "class"], [:SYMBOL, "ovie"]]
irb(main):008:0>

Notice that output there? The ‘M’ in ‘Movie’ did not match the regular expression but the ‘ovie’ in ‘Movie’ did. Well, clearly that’s not what we want. We want ‘Movie’ as a whole to be a symbol. But what kind of symbol? Well, since it starts with a capital letter, languages like C#, Java, and Ruby tend to treat those as constants. So let’s add the ability to tokenize a constant:

Now if you were to perform that same IRB session you’d see that you get this:

[[:CLASS, "class"], [:CONSTANT, "Movie"]]

Great! So now let’s run our test. Everything should be perfect, right?

rspec spec\lexer_spec.rb

And the output is ….?

  1) Testing the Lexer should have a class definition
     Failure/Error: @result.should == class_test_tokens
       expected: [[:CLASS, "class"], [:CONSTANT, "Movie"], [:NEWLINE, "\n"], [:NEWLINE, "\n"]]
            got: [[:CLASS, "class"], [:CONSTANT, "Movie"]] (using ==)

Bummer. The problem is we’re not handling those newlines. The newlines matter to me because they are how I indicate that the class definition is starting and ending. In reality, of course, it’s the curly brace characters that are doing that. Those are what delimit the class. But, for now, all that I need to know is that these are newlines. We can add one more bit to our lexer:

With that, your test should pass.

Wow! Was that a slog or what? In many ways, I’ve ended up right where I was in previous posts. BUT … what I did this time is have a better understanding of exactly what I’m doing. That’s going to be key to moving on to the next part, which is getting a parser in place.

I do plan on taking any remaining posts in this “Build Your Own Language” series just about as slow as I did this one. The goal is to learn, not just to have a bunch of stuff that someone may be able to figure out if they’re lucky. I’m doing this as much for myself as for anyone who may follow in my footsteps. Whether these posts serve as useful or a cautionary tale remains to be seen.

About Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.
This entry was posted in Language Building. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *