Lexical Analyzer

Lexical analyzer, also known as lexer or scanner, is a program that converts sequence of characters into a sequence of tokens. It simplifies and speeds up parsing because the parser does not have to recognize low level patterns such as whitespace, keywords, identifiers, or numbers. Instead, the parser can focus on the high level grammar and structure of the input.

A lexer is usually implemented using a deterministic finite automaton, and the DFA is typically represented by a regular expression. This is a natural choice for us too, as Javascript natively supports them.

import * as inp from "./input"
import * as utils from "./utils"
import * as err from "./error"

Recognizing a Token

The mapping from regular expressions to tokens is defined using the TokenMatcher<S> interface. A token can be of any type S, although usually an enumeration is used.

interface TokenMatcher<S> {
    regex: string
    token: S
}

Representing a Token

When a token is recognized it is wrapped in a Token<S> object. This contains also the recognized string for error reporting and diagnostics.

export class Token<S> {
    constructor(readonly token: S, readonly text: string) { }

We override the toString function so we can output a token to screen.

    toString() {
        return this.text ? utils.escapeWhitespace(this.text) : this.token
    }
}

Lexer

The lexer itself is a simple class that contains all the TokenMatchers and recognizes the next token in a string.

export class Lexer<S> {
    private tokens: TokenMatcher<S>[]
    private matcher: RegExp

The constructor initializes a lexer by:

  1. Converting a tokens object into an array of regex-token pairs.
  2. Building a combined regex pattern with named capture groups (g0, g1, etc.) for each token's regex.
  3. Creating a RegExp object matcher to match input strings against all token patterns.

The constructor adds two flags to the regular expression. The y flag makes the search sticky so that it scans the input string from the position indicated by the lastIndex property. The u flag makes the search support unicode characters.

    constructor(tokens: Record<string, S>) {
        this.tokens = Object.entries(tokens).map(([regex, token]) => ({
            regex, token }))
        let re = this.tokens.map((t, i) => `(?<g${i}>${t.regex})`).join("|")
        this.matcher = new RegExp(re, "yu")
    }

The matchToken method returns the next token in the input string. It uses the matcher to find a token at the specified position (pos). If a match is found, it checks which token group (g0, g1, ...) matched and returns the corresponding Token. If no match is found, it returns null.

    matchToken(input: string, pos: number): Token<S> | null {
        this.matcher.lastIndex = pos
        let match = this.matcher.exec(input)
        if (match && match.groups) {
            for (let i = 0; i < this.tokens.length; i++) {
                let res = match.groups[`g${i}`]
                if (res)
                    return new Token<S>(this.tokens[i].token, res)
            }
        }
        return null
    }
}

Lexer as Input

We can integrate lexing directly into the parsing process by implementing the ParserInput interface for any token. We don't expose LexerInput class outside the module. It can be created with the lexerInput function.

class LexerInput<S> implements inp.ParserInput<Token<S>> {

We define the fields of the ParserInput interface.

    position: number
    current: Token<S>
    state: any

We also need to store the input string and the lexer used. These we get as an argument.

    private input: string
    private lexer: Lexer<S>

To avoid recognizing the same token multiple times as the parser backtracks, we store all the tokens we have matched so far in an array.

    private tokens: Token<S>[]

We store the result designating end of the input.

    private eof: Token<S>

Create an input stream for given string and lexer. Initialize the instance variables.

    constructor(input: string, lexer: Lexer<S>, eof: Token<S>) {
        this.input = input
        this.lexer = lexer
        this.tokens = new Array<Token<S>>(input.length)
        this.position = -1
        this.eof = eof
        this.current = this.eof
    }

The iterator implementation is fairly straightforward. We need to make sure that the state variables position and current are kept in sync while we advance in the input string. We must also do a lookup in the cachΓ© before calling the lexer to recognize the token. If the lexer finds a match, we update the cachΓ©. If the lexer cannot recognize the next token, we throw a ParseError.

    next(): Token<S> {
        let pos = this.position
        pos += this.tokens[pos] ? this.tokens[pos].text.length : 1
        if (pos >= this.input.length)
            return this.eof
        this.position = pos
        let match = this.tokens[pos] || this.lexer.matchToken(this.input, pos)
        if (!match)
            throw new err.ParseError(err.ErrorSource.Lexer, pos,
                this.input.substr(pos, 10) + "...", ["<valid token>"])
        this.tokens[pos] = match
        this.current = match
        return match
    }
}

Create an input stream for given text string using the given lexer.

export function lexerInput<S>(text: string, lexer: Lexer<S>, eof: Token<S>):
    inp.ParserInput<Token<S>> {
    return new LexerInput<S>(text, lexer, eof)
}