Following on from the first post in this series, I did a development spike that proved to myself I had some clue as to what I was doing. So now I want to get into actually building a language from soup to nuts: meaning I want an interpreted language that executes via a runtime. Granted, this will be a very simple language (and not very unique). It will also be a very simple runtime. Let’s see what happens, shall we?
I am largely indebted to the book How To Create Your Own Freaking Awesome Programming Language. That book definitely did put me on the right track. That said, I have a hard time recommending the book. My problem with it is that I think the material is presented in a somewhat disjointed way, not leading to a very cohesive understanding of the overall set of tasks that you are performing. In fact, these blog posts are part of my attempt to put some order to what is described in the book.
Another book I found really useful in a “read it a lot and you may start to understand way” was Writing Compilers and Interpreters. (I used the third edition which focuses on Java rather than C or C++.) Another book that was only moderately helpful but did fill in some gaps was Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages.
I don’t want to do anything spectacular here since I’m just doing this as an experiment to lead up to my real project: a Gherkin-like alternative. So what I want to do is create a Ruby-like language but one that uses the curly brace syntax of C, C++, and Java. Why would I want to do this? Well, I had to do something and this seemed like a relatively easy thing to get started with.
So here is some sample code that ultimately I would like to get working:
class Movie { def name { puts "The Amazing Spider-Man" } def year { puts 2012 } } film = Movie.new puts film.name puts film.year
Consider that my requirement spec for the language I need to support and the tool I need to build to support it. And, yes, I do realize this is not good code in that any instance of this class would always return the same movie. The reason I’m starting this way is because it will force me to consider what the lexer has to deal with initially, in terms of common structures like constants, newlines, keywords, strings, and numbers.
Incidentally, the How To Create Your Own Freaking Awesome Programming Language book does a pretty cool thing: it provides a Ruby language with Python indentation. Due to the license of the book and its source, I don’t think I can show that here.
Okay, so we have the code we want to support. Also, we know a few things from my past posts on these topics:
- This code has to be provided as input to a lexer.
- The lexer will convert that input into tokens.
- Those tokens are going to be used by a parser.
- The parser will organize those tokens into a structure.
- That structure will be a tree of nodes.
Put in a nutshell, a specific grammar will be converted into a parser that can compile lexer tokens into nodes.
Now here’s where I get into some new territory. As with any language, my new language needs to have its data types and its structure represented in memory. So here are some new things I have to explore:
- The parser simply determines how the language can be structured.
- A runtime is what allows that structure to execute.
- In order for a runtime to work, you have to evaluate the nodes.
- An interpreter is what evaluates any code that was passed to it.
- The interpreter reads the syntax tree (from the parser) and executes each action associated with a node (from the runtime).
Put another way: once you have a runtime, you have to map the nodes you created to the runtime via an interpreter.
So, just one more time, let’s make sure we have the path down:
- Code (as input) goes into lexer.
- Lexer produces tokens.
- Parser takes tokens and creates nodes.
- Interpreter evaluates nodes and produces objects.
- Runtime evaluates the objects.
To make all these bullet points a little more specific, let’s say I have the following:
puts "The Amazing-Spider Man" puts 2012
Passed through the lexer, these expressions become tokens:
[:IDENTIFIER, puts] [:STRING, "The Amazing Spider-Man"] [:IDENTIFIER, puts] [:NUMBER, 2012]
Those tokens become nodes:
[:call, "puts",[:string, "The Amazing Spider-Man"]] [:call, "puts",[:number, 2012]]
Those nodes become structured into objects and calls that can be evaluated:
Object.call(:puts, "The Amazing Spider-Man") Object.call(:puts, 2012)
Those structures should then be capable of being executed by a runtime.
Start With Lexing
In a previous post regarding Rex, I showed you how to use a lexer generator to create your lexer. In the first post in this series I gave some idea of how to roll your own lexer rather than using a lexer generator. Since I’ve already beat that topic close to death, I’ll do something a little different here which is show you how I develop the lexer in a TDD fashion. So here’s what I do:
- Create a project directory called learn_language.
- Create a directory in the project directory called spec.
- Create a file in the spec directory called lexer_spec.rb.
- Create a file in the project directory called lexer.rb.
Here’s the start of the code in lexer_spec.rb:
1 2 3 4 5 6 7 8 9 10 11 |
require './lexer.rb' class LexerTest describe 'Testing the Lexer' do before do @evaluator = Lexer.new end end end # class: LexerTest |
To get that to work, I need the minimum in lexer.rb, which is:
1 2 3 |
class Lexer # Code goes here ... end # class: Lexer |
There’s no actual test involved yet. So let’s put a test in that will drive how I want the lexer to work:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
require './lexer.rb' class LexerTest describe 'Testing the Lexer' do before do @evaluator = Lexer.new end it 'should parse a class definition' do @result = @evaluator.tokenize(class_test) @result.should == class_test_tokens end end end # class: LexerTest |
This means I’ll need a tokenize method in my lexer. I’ll pass that method some input called class_test so I’ll need to define that. I’m going to compare whatever the lexer returns with the contents of class_test_tokens so I’ll need to define that as well. Let’s define class_test and class_test_tokens in lexer_spec.rb:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
class LexerTest class_test = <<-INPUT class Movie { } INPUT class_test_tokens = [ [:CLASS, "class"], [:CONSTANT, "Movie"], [:NEWLINE, "\n"], [:NEWLINE, "\n"] ] ... end # class: LexerTest |
So here I’m testing a very simple and constrained aspect of my overall requirement spec. I’m just going to make sure that I can create a class definition. Now to make this work, we’ll add some code to lexer.rb. What I will do here that I didn’t do in previous posts is explain a bit more about what I’m doing.
So here’s what I’ll add to get started:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
class Lexer KEYWORDS = ["class"] def tokenize(input) input.chomp! char_pos = 0 tokens = [] # Loop goes here... tokens end # method: tokenize end # class: Lexer |
I set up a constant to hold my keywords. Currently the only keyword I’ll have is “class.” From my test, I know I need a tokenize method in my lexer. This method is the work horse of the lexer. First I remove any newline characters with chomp. Chomp removes the last character from a string only if that character is a newline. Then I set up a character position variable (char_pos) and set it to zero since no lexing has taken place yet. Finally, I establish an empty tokens array. This array is what will be returned from the lexer.
What I now need is a loop that goes through the provided input and does the actual tokenizing. The loop is going to break the input up into chunks. Those chunks will then be mapped to tokens or can be “thrown away” if they are not needed. So here’s the next bit of logic:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
class Lexer KEYWORDS = ["class"] def tokenize(input) input.chomp! char_pos = 0 tokens = [] while char_pos < input.size chunk = input[char_pos..-1] char_pos += 1 end # while: lexer loop tokens end # method: tokenize end # class: Lexer |
So here I’m going to be getting a chunk of the input based on the current character position up to the end of the line. The character position will, of course, be increased as the lexer loops through the input. The above, of course, will not work because we have no way actually getting chunks of the input. We don’t want each individual character of the input; we want groupings of the input to become tokens.
So let’s add this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
class Lexer KEYWORDS = ["class"] def tokenize(input) input.chomp! char_pos = 0 tokens = [] while char_pos < input.size chunk = input[char_pos..-1] if symbol = chunk[/\A([a-z]\w*)/, 1] if KEYWORDS.include?(symbol) tokens << [symbol.upcase.to_sym, symbol] else tokens << [:SYMBOL, symbol] end char_pos += symbol.size else char_pos += 1 end end # while: lexer loop tokens end # method: tokenize end # class: Lexer |
Here I’m saying if you find a string that is composed solely of lower case letters and no spaces, then that’s a symbol. If that symbol also happens to be found in the KEYWORDS array, then it will be a token with the name of that keyword. Otherwise, it is a generic symbol.
Let’s try it via IRB:
$ ./Projects/learn_language: irb irb(main):001:0> require './lexer.rb' => true irb(main):002:0> input = <<-INPUT irb(main):003:0" class Movie irb(main):004:0" { irb(main):005:0" } irb(main):006:0" INPUT => "class Movie\n{\n}\n" irb(main):007:0> lexer = Lexer.new.tokenize(input) => [[:CLASS, "class"], [:SYMBOL, "ovie"]] irb(main):008:0>
Notice that output there? The ‘M’ in ‘Movie’ did not match the regular expression but the ‘ovie’ in ‘Movie’ did. Well, clearly that’s not what we want. We want ‘Movie’ as a whole to be a symbol. But what kind of symbol? Well, since it starts with a capital letter, languages like C#, Java, and Ruby tend to treat those as constants. So let’s add the ability to tokenize a constant:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
class Lexer KEYWORDS = ["class"] def tokenize(input) input.chomp! char_pos = 0 tokens = [] while char_pos < input.size chunk = input[char_pos..-1] if symbol = chunk[/\A([a-z]\w*)/, 1] if KEYWORDS.include?(symbol) tokens << [symbol.upcase.to_sym, symbol] else tokens << [:SYMBOL, symbol] end char_pos += symbol.size elsif constant = chunk[/\A([A-Z]\w*)/, 1] tokens << [:CONSTANT, constant] char_pos += constant.size else char_pos += 1 end end # while: lexer loop tokens end # method: tokenize end # class: Lexer |
Now if you were to perform that same IRB session you’d see that you get this:
[[:CLASS, "class"], [:CONSTANT, "Movie"]]
Great! So now let’s run our test. Everything should be perfect, right?
rspec spec\lexer_spec.rb
And the output is ….?
1) Testing the Lexer should have a class definition Failure/Error: @result.should == class_test_tokens expected: [[:CLASS, "class"], [:CONSTANT, "Movie"], [:NEWLINE, "\n"], [:NEWLINE, "\n"]] got: [[:CLASS, "class"], [:CONSTANT, "Movie"]] (using ==)
Bummer. The problem is we’re not handling those newlines. The newlines matter to me because they are how I indicate that the class definition is starting and ending. In reality, of course, it’s the curly brace characters that are doing that. Those are what delimit the class. But, for now, all that I need to know is that these are newlines. We can add one more bit to our lexer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
class Lexer KEYWORDS = ["class"] def tokenize(input) input.chomp! char_pos = 0 tokens = [] while char_pos < input.size chunk = input[char_pos..-1] if symbol = chunk[/\A([a-z]\w*)/, 1] if KEYWORDS.include?(symbol) tokens << [symbol.upcase.to_sym, symbol] else tokens << [:SYMBOL, symbol] end char_pos += symbol.size elsif constant = chunk[/\A([A-Z]\w*)/, 1] tokens << [:CONSTANT, constant] char_pos += constant.size elsif chunk.match(/\A\n+/) tokens << [:NEWLINE, "\n"] char_pos += 1 else char_pos += 1 end end # while: lexer loop tokens end # method: tokenize end # class: Lexer |
With that, your test should pass.
Wow! Was that a slog or what? In many ways, I’ve ended up right where I was in previous posts. BUT … what I did this time is have a better understanding of exactly what I’m doing. That’s going to be key to moving on to the next part, which is getting a parser in place.
I do plan on taking any remaining posts in this “Build Your Own Language” series just about as slow as I did this one. The goal is to learn, not just to have a bunch of stuff that someone may be able to figure out if they’re lucky. I’m doing this as much for myself as for anyone who may follow in my footsteps. Whether these posts serve as useful or a cautionary tale remains to be seen.