Build Your Own Language, Part 1

In previous posts I talked about how one of my goals was to come up with an alternative to Gherkin. This is not in the “just to see if I can” category but rather because I believe it could be useful to roll your own Gherkin-like language. This post, however, will be in the “just to see if I can” category. Here I want to take what I learned about lexing and parsing and apply that to creating a simple programming language. Fair warning: I have no idea how this is going to turn out.

This post is definitely the equivalent of a development spike for me in that I plan on anything I do here being throwaway code. So let’s recap some things I talked about in previous posts on this topic.

The lexer is the part of a designed language that converts the input — the code you want to execute — into tokens that the parser can understand. What the lexer does is split the input it’s given into tokens. Each token is then tagged with the type of token it contains. All of this makes it easier for the parser to operate because the parser doesn’t have to worry about whether something is a string or a number or a constant or whatever other constructs your language has.

Tools like Rex and Lex — the former of which I looked at in my previous posts — are really not lexers. They are lexer generators or lexer compilers. You supply these tools a grammar and they will output a lexer.

By themselves, the tokens output by the lexer are really just the building blocks of a budding language. The parser then provides a context to those tokens by organizing them in a structure. Whereas a lexer produces an array of tokens, a parser produces a tree of nodes. There are two main kinds of tree: a parse tree is a record of the tokens that are used to match some input text. A syntax tree records the structure of the input and is independent of the grammar that produced it. The most common parser output is called an Abstract Syntax Tree (AST).

Just as there are lexer generators, as I mentioned above, there are also parser generators. Parser generators are commonly used to accomplish the often tedious task of building a parser from scratch. This in fact is what I used Racc for in some of my previous posts.

So for the purposes of this experiment, I’m going to start simple. Rather than worrying about the structure of a new programming language, I’m just going to see if I can handle one aspect of a programming language. Specifically, I want to be able to have this logic parsed:

if working == True
{
  print "Solidum petit in profundis."
}

From a lexing standpoint, here is what would get generated:

[IDENTIFIER if] [IDENTIFIER working] [EQUAL] [IDENTIFIER True]
[OPENBRACE {]
[IDENTIFIER print] [STRING "Solidum petit in profundis."]
[CLOSEBRACE }]

Although, as you’ll see, it’s actually a bit more complicated than that. But that’s what I want to handle. So what I want is a lexer that will give me that output so that eventually I can feed it to a parser. Now, in my prior post we used Rex to generate a lexer for us. It’s also possible — and, as I’ve found, easier — to roll you own lexer.

First, let’s create the most simple aspects of the lexer. Create a file called lexer.rb.

class Lexer
  KEYWORDS = ["if", "true"]
  def tokenize(input)
    input.chomp!
    tokens = []
    
    # Tokenize logic goes here.
    
    tokens
  end # method: tokenize
end # class: Lexer

class Lexer

KEYWORDS = ["if", "true"]

def tokenize(input)

input.chomp!

tokens = []

# Tokenize logic goes here.

tokens

end # method: tokenize

end # class: Lexer

This is just a skeleton of what we need. You can see the only keywords I’ll have in my language so far. As with the lexer we generated with Rex, we have to create a tokenize method to actually break down the input. Here I’ve started the basis of that: I remove any newline from the end of the input string, and then I provide a tokens array that will hold the tokens that are found. The array of tokens will be returned from the method when all is said and done.

So now I’ll put the meat into the lexer:

class Lexer
  KEYWORDS = ["if", "true"]
  def tokenize(input)
    input.chomp!
    current = 0
    tokens = []
    
    while current < input.size
      chunk = input[current..-1]
      
      if identifier = chunk[/\A([a-z]\w*)/, 1]
        if KEYWORDS.include?(identifier)
          tokens << [identifier.upcase.to_sym, identifier]
        else
          tokens << [:IDENTIFIER, identifier]
        end
        current += identifier.size
      elsif constant = chunk[/\A([A-Z]\w*)/, 1]
        tokens << [:CONSTANT, constant]
        current += constant.size
      elsif string = chunk[/\A"(.*?)"/, 1]
        tokens << [:STRING, string]
        current += string.size + 2
      elsif chunk.match(/\A\n+/)
        tokens << [:NEWLINE, "\n"]
        current += 1
      elsif chunk.match(/\A /)
        current += 1
      else
        value = chunk[0,1]
        tokens << [value, value]
        current += 1
      end
    end
    
    tokens
  end # method: tokenize
end # class: Lexer

class Lexer

KEYWORDS = ["if", "true"]

def tokenize(input)

input.chomp!

current = 0

tokens = []

while current < input.size

chunk = input[current..-1]

if identifier = chunk[/\A([a-z]\w*)/, 1]

if KEYWORDS.include?(identifier)

tokens << [identifier.upcase.to_sym, identifier]

else

tokens << [:IDENTIFIER, identifier]

end

current += identifier.size

elsif constant = chunk[/\A([A-Z]\w*)/, 1]

tokens << [:CONSTANT, constant]

current += constant.size

elsif string = chunk[/\A"(.*?)"/, 1]

tokens << [:STRING, string]

current += string.size + 2

elsif chunk.match(/\A\n+/)

tokens << [:NEWLINE, "\n"]

current += 1

elsif chunk.match(/\A /)

current += 1

else

value = chunk[0,1]

tokens << [value, value]

current += 1

end

tokens

end # method: tokenize

end # class: Lexer

Here I’m basically just using regular expressions to parse identifiers, constants, strings, newlines, and spaces. All of these we will be added to the tokens array, both with a specific symbol and the specific value of the token.

So now let’s test if this is working. Create a spec directory and within that create a lexer_spec.rb. This is where we’ll test out the logic. However, earlier I had mentioned that a barebones lexer output might be this:

[IDENTIFIER if] [IDENTIFIER working] [EQUAL] [IDENTIFIER True]
[OPENBRACE {]
[IDENTIFIER print] [STRING "Solidum petit in profundis."]
[CLOSEBRACE }]

Given the logic I put in the lexer.rb file, you can perhaps see that instead the lexer output will be this:

[:IF, "if"] [:IDENTIFIER, "working"] ["=", "="] ["=", "="] [:CONSTANT, "True"]
[:NEWLINE, "\n"]
["{", "{"]
[:NEWLINE, "\n"]
[:IDENTIFIER, "print"] [:STRING, "Solidum petit in profundis."]
[:NEWLINE, "\n"]
["}", "}"]

If, in fact, that is in no way clear to you at all, I recommend looking at the logic of the tokenize method. Look how the array token is returned. This is a useful exercise in terms of understanding how these things work. We already know our input data condition:

if working == True
{
  print "Solidum petit in profundis."
}

With this information, here’s the start of the logic I’ll use in lexer_spec.rb:

require "./lexer"

class LanguageLexer
input = <<-INPUT
if working == True
{
  print "Solidum petit in profundis."
}
INPUT

tokens = [
  [:IF, "if"], [:IDENTIFIER, "working"], ["=", "="], ["=", "="], [:CONSTANT, "True"],
  [:NEWLINE, "\n"],
  ["{", "{"],
  [:NEWLINE, "\n"],
  [:IDENTIFIER, "print"], [:STRING, "Hello."],
  [:NEWLINE, "\n"],
  ["}", "}"]
]

  describe 'Testing the Lexer' do
    before do
      @evaluator = Lexer.new
    end
    
    it 'parses some code' do
      @result = @evaluator.tokenize(input)
      @result.should == tokens
    end
  end
  
end

require "./lexer"

class LanguageLexer

input = <<-INPUT

if working == True

{

print "Solidum petit in profundis."

}

INPUT

tokens = [

[:IF, "if"], [:IDENTIFIER, "working"], ["=", "="], ["=", "="], [:CONSTANT, "True"],

[:NEWLINE, "\n"],

["{", "{"],

[:NEWLINE, "\n"],

[:IDENTIFIER, "print"], [:STRING, "Hello."],

[:NEWLINE, "\n"],

["}", "}"]

]

describe 'Testing the Lexer' do

before do

@evaluator = Lexer.new

end

it 'parses some code' do

@result = @evaluator.tokenize(input)

@result.should == tokens

end

If you want to test some more isolated bits of functionality to prove to yourself how things work, you can use tests like these:

  ...
    it 'test string' do
      @result = @evaluator.tokenize("testing")
      @result.should == [[:IDENTIFIER, "testing"]]
    end
    
    it 'test constant' do
      @result = @evaluator.tokenize("True")
      @result.should == [[:CONSTANT, "True"]]
    end
    
    it 'test single operator' do
      @result = @evaluator.tokenize("+")
      @result.should == [["+", "+"]]
    end
    
    it 'test double operator' do
      @result = @evaluator.tokenize("==")
      @result.should == [["=", "="], ["=", "="]]
    end
  ...

...

it 'test string' do

@result = @evaluator.tokenize("testing")

@result.should == [[:IDENTIFIER, "testing"]]

end

it 'test constant' do

@result = @evaluator.tokenize("True")

@result.should == [[:CONSTANT, "True"]]

end

it 'test single operator' do

@result = @evaluator.tokenize("+")

@result.should == [["+", "+"]]

end

it 'test double operator' do

@result = @evaluator.tokenize("==")

@result.should == [["=", "="], ["=", "="]]

end

...

If you are playing around with this stuff, it’s really important to get a feel why the above logic works. A perfect way to do that is to try to modify aspects of it. As an example of that, let’s say that I want to recognize the braces as a particular symbol rather than just as a string. I can simply make that a specific check in lexer.rb, as such:

  ...
      elsif brace = chunk[/\A(\{|\})/, 1]
        tokens << [:BRACE, brace]
        current += 1
  ...

...

elsif brace = chunk[/\A(\{|\})/, 1]

tokens << [:BRACE, brace]

current += 1

...

(I’ll assume that you can find the relevant place in the file to add this bit of logic.) That change would in turn require a change to the tokens array in the lexer_spec.rb file as such:

...
tokens = [
  [:IF, "if"], [:IDENTIFIER, "working"], ["=", "="], ["=", "="], [:CONSTANT, "True"],
  [:NEWLINE, "\n"],
  [:BRACE, "{"],
  [:NEWLINE, "\n"],
  [:IDENTIFIER, "print"], [:STRING, "Solidum petit in profundis."],
  [:NEWLINE, "\n"],
  [:BRACE, "}"]
]
...

...

tokens = [

[:IF, "if"], [:IDENTIFIER, "working"], ["=", "="], ["=", "="], [:CONSTANT, "True"],

[:NEWLINE, "\n"],

[:BRACE, "{"],

[:NEWLINE, "\n"],

[:IDENTIFIER, "print"], [:STRING, "Solidum petit in profundis."],

[:NEWLINE, "\n"],

[:BRACE, "}"]

]

...

This is a pretty minimal start but I really recommend playing around with simple examples like I’ve provided here. Maybe it’s just my lack of searching ability, but I rarely find too much useful stuff like this on the web. (Granted, that presumes what I posted here is useful.) I find plenty of information about lexing and compiler construction but the information is often so obtuse — or over my head, if you prefer — that I find I can’t even get started. Hopefully what I’m providing here is a bit different.

I found it’s best to take things slow when learning these concepts and give people something to play with before delving too much into an extended example. What I plan to do next is use the knowledge from this post as well as my three posts on Rex and Racc — see part 1, part 2, part 3 — to give you a full walkthrough of building a language, from the lexer on down to a runtime that can actually execute aspects of the language you are creating.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

Leave a Reply Cancel reply