Build Your Own Language, Part 2

Following on from the first post in this series, I did a development spike that proved to myself I had some clue as to what I was doing. So now I want to get into actually building a language from soup to nuts: meaning I want an interpreted language that executes via a runtime. Granted, this will be a very simple language (and not very unique). It will also be a very simple runtime. Let’s see what happens, shall we?

I am largely indebted to the book How To Create Your Own Freaking Awesome Programming Language. That book definitely did put me on the right track. That said, I have a hard time recommending the book. My problem with it is that I think the material is presented in a somewhat disjointed way, not leading to a very cohesive understanding of the overall set of tasks that you are performing. In fact, these blog posts are part of my attempt to put some order to what is described in the book.

Another book I found really useful in a “read it a lot and you may start to understand way” was Writing Compilers and Interpreters. (I used the third edition which focuses on Java rather than C or C++.) Another book that was only moderately helpful but did fill in some gaps was Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages.

I don’t want to do anything spectacular here since I’m just doing this as an experiment to lead up to my real project: a Gherkin-like alternative. So what I want to do is create a Ruby-like language but one that uses the curly brace syntax of C, C++, and Java. Why would I want to do this? Well, I had to do something and this seemed like a relatively easy thing to get started with.

So here is some sample code that ultimately I would like to get working:

class Movie
{
  def name
  {
    puts "The Amazing Spider-Man"
  }
  
  def year
  {
    puts 2012
  }
}

film = Movie.new
puts film.name
puts film.year

Consider that my requirement spec for the language I need to support and the tool I need to build to support it. And, yes, I do realize this is not good code in that any instance of this class would always return the same movie. The reason I’m starting this way is because it will force me to consider what the lexer has to deal with initially, in terms of common structures like constants, newlines, keywords, strings, and numbers.

Incidentally, the How To Create Your Own Freaking Awesome Programming Language book does a pretty cool thing: it provides a Ruby language with Python indentation. Due to the license of the book and its source, I don’t think I can show that here.

Okay, so we have the code we want to support. Also, we know a few things from my past posts on these topics:

This code has to be provided as input to a lexer.
The lexer will convert that input into tokens.
Those tokens are going to be used by a parser.
The parser will organize those tokens into a structure.
That structure will be a tree of nodes.

Put in a nutshell, a specific grammar will be converted into a parser that can compile lexer tokens into nodes.

Now here’s where I get into some new territory. As with any language, my new language needs to have its data types and its structure represented in memory. So here are some new things I have to explore:

The parser simply determines how the language can be structured.
A runtime is what allows that structure to execute.
In order for a runtime to work, you have to evaluate the nodes.
An interpreter is what evaluates any code that was passed to it.
The interpreter reads the syntax tree (from the parser) and executes each action associated with a node (from the runtime).

Put another way: once you have a runtime, you have to map the nodes you created to the runtime via an interpreter.

So, just one more time, let’s make sure we have the path down:

Code (as input) goes into lexer.
Lexer produces tokens.
Parser takes tokens and creates nodes.
Interpreter evaluates nodes and produces objects.
Runtime evaluates the objects.

To make all these bullet points a little more specific, let’s say I have the following:

puts "The Amazing-Spider Man"
puts 2012

Passed through the lexer, these expressions become tokens:

[:IDENTIFIER, puts]
[:STRING, "The Amazing Spider-Man"]
[:IDENTIFIER, puts]
[:NUMBER, 2012]

Those tokens become nodes:

[:call, "puts",[:string, "The Amazing Spider-Man"]]
[:call, "puts",[:number, 2012]]

Those nodes become structured into objects and calls that can be evaluated:

Object.call(:puts, "The Amazing Spider-Man")
Object.call(:puts, 2012)

Those structures should then be capable of being executed by a runtime.

Start With Lexing

In a previous post regarding Rex, I showed you how to use a lexer generator to create your lexer. In the first post in this series I gave some idea of how to roll your own lexer rather than using a lexer generator. Since I’ve already beat that topic close to death, I’ll do something a little different here which is show you how I develop the lexer in a TDD fashion. So here’s what I do:

Create a project directory called learn_language.
Create a directory in the project directory called spec.
Create a file in the spec directory called lexer_spec.rb.
Create a file in the project directory called lexer.rb.

Here’s the start of the code in lexer_spec.rb:

require './lexer.rb'

class LexerTest
  
  describe 'Testing the Lexer' do
    before do
      @evaluator = Lexer.new
    end
  end
  
end # class: LexerTest

require './lexer.rb'

class LexerTest

describe 'Testing the Lexer' do

before do

@evaluator = Lexer.new

end

end # class: LexerTest

To get that to work, I need the minimum in lexer.rb, which is:

class Lexer
  # Code goes here ...
end # class: Lexer

class Lexer

# Code goes here ...

end # class: Lexer

There’s no actual test involved yet. So let’s put a test in that will drive how I want the lexer to work:

require './lexer.rb'

class LexerTest
  
  describe 'Testing the Lexer' do
    before do
      @evaluator = Lexer.new
    end

    it 'should parse a class definition' do
      @result = @evaluator.tokenize(class_test)
      @result.should == class_test_tokens
    end
  end
  
end # class: LexerTest

require './lexer.rb'

class LexerTest

describe 'Testing the Lexer' do

before do

@evaluator = Lexer.new

end

it 'should parse a class definition' do

@result = @evaluator.tokenize(class_test)

@result.should == class_test_tokens

end

end # class: LexerTest

This means I’ll need a tokenize method in my lexer. I’ll pass that method some input called class_test so I’ll need to define that. I’m going to compare whatever the lexer returns with the contents of class_test_tokens so I’ll need to define that as well. Let’s define class_test and class_test_tokens in lexer_spec.rb:

class LexerTest

class_test = <<-INPUT
class Movie
{
}
INPUT

class_test_tokens = [
  [:CLASS, "class"], [:CONSTANT, "Movie"],
  [:NEWLINE, "\n"], [:NEWLINE, "\n"]
]
  ...
end # class: LexerTest

class LexerTest

class_test = <<-INPUT

class Movie

{

}

INPUT

class_test_tokens = [

[:CLASS, "class"], [:CONSTANT, "Movie"],

[:NEWLINE, "\n"], [:NEWLINE, "\n"]

]

...

end # class: LexerTest

So here I’m testing a very simple and constrained aspect of my overall requirement spec. I’m just going to make sure that I can create a class definition. Now to make this work, we’ll add some code to lexer.rb. What I will do here that I didn’t do in previous posts is explain a bit more about what I’m doing.

So here’s what I’ll add to get started:

class Lexer
  KEYWORDS = ["class"]
	
  def tokenize(input)
    input.chomp!
    char_pos = 0
    tokens = []

    # Loop goes here...
		
    tokens
  end # method: tokenize

end # class: Lexer

class Lexer

KEYWORDS = ["class"]

def tokenize(input)

input.chomp!

char_pos = 0

tokens = []

# Loop goes here...

tokens

end # method: tokenize

end # class: Lexer

I set up a constant to hold my keywords. Currently the only keyword I’ll have is “class.” From my test, I know I need a tokenize method in my lexer. This method is the work horse of the lexer. First I remove any newline characters with chomp. Chomp removes the last character from a string only if that character is a newline. Then I set up a character position variable (char_pos) and set it to zero since no lexing has taken place yet. Finally, I establish an empty tokens array. This array is what will be returned from the lexer.

What I now need is a loop that goes through the provided input and does the actual tokenizing. The loop is going to break the input up into chunks. Those chunks will then be mapped to tokens or can be “thrown away” if they are not needed. So here’s the next bit of logic:

class Lexer
  KEYWORDS = ["class"]
	
  def tokenize(input)
    input.chomp!
    char_pos = 0
    tokens = []

    while char_pos < input.size
      chunk = input[char_pos..-1]

      char_pos += 1
    end # while: lexer loop

    tokens
  end # method: tokenize

end # class: Lexer

class Lexer

KEYWORDS = ["class"]

def tokenize(input)

input.chomp!

char_pos = 0

tokens = []

while char_pos < input.size

chunk = input[char_pos..-1]

char_pos += 1

end # while: lexer loop

tokens

end # method: tokenize

end # class: Lexer

So here I’m going to be getting a chunk of the input based on the current character position up to the end of the line. The character position will, of course, be increased as the lexer loops through the input. The above, of course, will not work because we have no way actually getting chunks of the input. We don’t want each individual character of the input; we want groupings of the input to become tokens.

So let’s add this:

class Lexer
  KEYWORDS = ["class"]
	
  def tokenize(input)
    input.chomp!
    char_pos = 0
    tokens = []

    while char_pos < input.size
      chunk = input[char_pos..-1]
      
      if symbol = chunk[/\A([a-z]\w*)/, 1]
        if KEYWORDS.include?(symbol)
          tokens << [symbol.upcase.to_sym, symbol]
        else
          tokens << [:SYMBOL, symbol]
        end
        char_pos += symbol.size
      else
        char_pos += 1
      end
    end # while: lexer loop

    tokens
  end # method: tokenize

end # class: Lexer

class Lexer

KEYWORDS = ["class"]

def tokenize(input)

input.chomp!

char_pos = 0

tokens = []

while char_pos < input.size

chunk = input[char_pos..-1]

if symbol = chunk[/\A([a-z]\w*)/, 1]

if KEYWORDS.include?(symbol)

tokens << [symbol.upcase.to_sym, symbol]

else

tokens << [:SYMBOL, symbol]

end

char_pos += symbol.size

else

char_pos += 1

end

end # while: lexer loop

tokens

end # method: tokenize

end # class: Lexer

Here I’m saying if you find a string that is composed solely of lower case letters and no spaces, then that’s a symbol. If that symbol also happens to be found in the KEYWORDS array, then it will be a token with the name of that keyword. Otherwise, it is a generic symbol.

Let’s try it via IRB:

$ ./Projects/learn_language: irb
irb(main):001:0> require './lexer.rb'
=> true
irb(main):002:0> input = <<-INPUT
irb(main):003:0" class Movie
irb(main):004:0" {
irb(main):005:0" }
irb(main):006:0" INPUT
=> "class Movie\n{\n}\n"
irb(main):007:0> lexer = Lexer.new.tokenize(input)
=> [[:CLASS, "class"], [:SYMBOL, "ovie"]]
irb(main):008:0>

Notice that output there? The ‘M’ in ‘Movie’ did not match the regular expression but the ‘ovie’ in ‘Movie’ did. Well, clearly that’s not what we want. We want ‘Movie’ as a whole to be a symbol. But what kind of symbol? Well, since it starts with a capital letter, languages like C#, Java, and Ruby tend to treat those as constants. So let’s add the ability to tokenize a constant:

class Lexer
  KEYWORDS = ["class"]
	
  def tokenize(input)
    input.chomp!
    char_pos = 0
    tokens = []

    while char_pos < input.size
      chunk = input[char_pos..-1]
      
      if symbol = chunk[/\A([a-z]\w*)/, 1]
        if KEYWORDS.include?(symbol)
          tokens << [symbol.upcase.to_sym, symbol]
        else
          tokens << [:SYMBOL, symbol]
        end
        char_pos += symbol.size

      elsif constant = chunk[/\A([A-Z]\w*)/, 1]
        tokens << [:CONSTANT, constant]
        char_pos += constant.size

      else
        char_pos += 1
      end
    end # while: lexer loop

    tokens
  end # method: tokenize

end # class: Lexer

class Lexer

KEYWORDS = ["class"]

def tokenize(input)

input.chomp!

char_pos = 0

tokens = []

while char_pos < input.size

chunk = input[char_pos..-1]

if symbol = chunk[/\A([a-z]\w*)/, 1]

if KEYWORDS.include?(symbol)

tokens << [symbol.upcase.to_sym, symbol]

else

tokens << [:SYMBOL, symbol]

end

char_pos += symbol.size

elsif constant = chunk[/\A([A-Z]\w*)/, 1]

tokens << [:CONSTANT, constant]

char_pos += constant.size

else

char_pos += 1

end

end # while: lexer loop

tokens

end # method: tokenize

end # class: Lexer

Now if you were to perform that same IRB session you’d see that you get this:

[[:CLASS, "class"], [:CONSTANT, "Movie"]]

Great! So now let’s run our test. Everything should be perfect, right?

rspec spec\lexer_spec.rb

And the output is ….?

  1) Testing the Lexer should have a class definition
     Failure/Error: @result.should == class_test_tokens
       expected: [[:CLASS, "class"], [:CONSTANT, "Movie"], [:NEWLINE, "\n"], [:NEWLINE, "\n"]]
            got: [[:CLASS, "class"], [:CONSTANT, "Movie"]] (using ==)

Bummer. The problem is we’re not handling those newlines. The newlines matter to me because they are how I indicate that the class definition is starting and ending. In reality, of course, it’s the curly brace characters that are doing that. Those are what delimit the class. But, for now, all that I need to know is that these are newlines. We can add one more bit to our lexer:

class Lexer
  KEYWORDS = ["class"]
	
  def tokenize(input)
    input.chomp!
    char_pos = 0
    tokens = []

    while char_pos < input.size
      chunk = input[char_pos..-1]
      
      if symbol = chunk[/\A([a-z]\w*)/, 1]
        if KEYWORDS.include?(symbol)
          tokens << [symbol.upcase.to_sym, symbol]
        else
          tokens << [:SYMBOL, symbol]
        end
        char_pos += symbol.size

      elsif constant = chunk[/\A([A-Z]\w*)/, 1]
        tokens << [:CONSTANT, constant]
        char_pos += constant.size

      elsif chunk.match(/\A\n+/)
        tokens << [:NEWLINE, "\n"]
        char_pos += 1

      else
        char_pos += 1
      end
    end # while: lexer loop

    tokens
  end # method: tokenize

end # class: Lexer

class Lexer

KEYWORDS = ["class"]

def tokenize(input)

input.chomp!

char_pos = 0

tokens = []

while char_pos < input.size

chunk = input[char_pos..-1]

if symbol = chunk[/\A([a-z]\w*)/, 1]

if KEYWORDS.include?(symbol)

tokens << [symbol.upcase.to_sym, symbol]

else

tokens << [:SYMBOL, symbol]

end

char_pos += symbol.size

elsif constant = chunk[/\A([A-Z]\w*)/, 1]

tokens << [:CONSTANT, constant]

char_pos += constant.size

elsif chunk.match(/\A\n+/)

tokens << [:NEWLINE, "\n"]

char_pos += 1

else

char_pos += 1

end

end # while: lexer loop

tokens

end # method: tokenize

end # class: Lexer

With that, your test should pass.

Wow! Was that a slog or what? In many ways, I’ve ended up right where I was in previous posts. BUT … what I did this time is have a better understanding of exactly what I’m doing. That’s going to be key to moving on to the next part, which is getting a parser in place.

I do plan on taking any remaining posts in this “Build Your Own Language” series just about as slow as I did this one. The goal is to learn, not just to have a bunch of stuff that someone may be able to figure out if they’re lucky. I’m doing this as much for myself as for anyone who may follow in my footsteps. Whether these posts serve as useful or a cautionary tale remains to be seen.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

Start With Lexing

Leave a Reply Cancel reply