Lexical analyzer, also known as lexer or scanner, is a program that converts sequence of characters into a sequence of tokens. It simplifies and speeds up parsing because the parser does not have to recognize low level patterns such as whitespace, keywords, identifiers, or numbers. Instead, the parser can focus on the high level grammar and structure of the input.
A lexer is usually implemented using a deterministic finite automaton, and the DFA is typically represented by a regular expression. This is a natural choice for us too, as Javascript natively supports them.
import * as inp from "./input"
import * as utils from "./utils"
import * as err from "./error"
The mapping from regular expressions to tokens is defined using the
TokenMatcher<S>
interface. A token can be of any type S
, although usually
an enumeration is used.
interface TokenMatcher<S> {
regex: RegExp
token: S
}
When a token is recognized it is wrapped in a Token<S>
object. This
contains also the recognized string for error reporting and diagnostics.
export class Token<S> {
constructor(readonly token: S, readonly text: string) { }
We override the toString
function so we can output a token
to screen.
toString() {
return this.text ? utils.escapeWhitespace(this.text) : this.token
}
}
The lexer itself is a simple class that contains all the TokenMatchers and recognizes the next token in a string.
export class Lexer<S> {
private matchers: TokenMatcher<S>[]
The constructor adds two flags to the regular expressions given as
arguments. The y
flag makes the search sticky so that it scans the
input string from the position indicated by the lastIndex
property.
The u
flag makes the search support unicode characters.
constructor(...tokens: [RegExp, S][]) {
this.matchers = tokens.map(t => ({
regex: new RegExp(t[0], "yu"),
token: t[1]
}))
}
We check matchers one-by-one in the order they were given to
recognize the token in the given position. If none of the matchers
succeed, we return null
.
matchToken(input: string, pos: number): Token<S> | null {
for (let i = 0; i < this.matchers.length; i++) {
let matcher = this.matchers[i]
matcher.regex.lastIndex = pos
let match = matcher.regex.exec(input)
if (match != null)
return new Token<S>(matcher.token, match[0])
}
return null
}
}
We can integrate lexing directly into the parsing process by implementing
the ParserInput
interface for any token. We don't expose LexerInput
class outside the module. It can be created with the lexerInput
function.
class LexerInput<S> implements inp.ParserInput<Token<S>> {
We define the fields of the ParserInput
interface.
position: number
current: Token<S>
state: any
We also need to store the input string and the lexer used. These we get as an argument.
private input: string
private lexer: Lexer<S>
To avoid recognizing the same token multiple times as the parser backtracks, we store all the tokens we have matched so far in an array.
private tokens: Token<S>[]
We store the result designating end of the input.
private eof: Token<S>
Create an input stream for given string and lexer. Initialize the instance variables.
constructor(input: string, lexer: Lexer<S>, eof: Token<S>) {
this.input = input
this.lexer = lexer
this.tokens = new Array<Token<S>>(input.length)
this.position = -1
this.eof = eof
this.current = this.eof
}
The iterator implementation is fairly straightforward. We need to make
sure that the state variables position
and current
are kept in sync
while we advance in the input string. We must also do a lookup in the
cachΓ© before calling the lexer to recognize the token. If the lexer finds
a match, we update the cachΓ©. If the lexer cannot recognize the next
token, we throw a ParseError
.
next(): Token<S> {
let pos = this.position
pos += this.tokens[pos] ? this.tokens[pos].text.length : 1
if (pos >= this.input.length)
return this.eof
this.position = pos
let match = this.tokens[pos] || this.lexer.matchToken(this.input, pos)
if (!match)
throw new err.ParseError(err.ErrorSource.Lexer, pos,
this.input.substr(pos, 10) + "...", ["<valid token>"])
this.tokens[pos] = match
this.current = match
return match
}
}
Create an input stream for given text
string using the given lexer
.
export function lexerInput<S>(text: string, lexer: Lexer<S>, eof: Token<S>):
inp.ParserInput<Token<S>> {
return new LexerInput<S>(text, lexer, eof)
}