From ANTLR website: “ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees.”
Unit tests for ANTLR Lexer
ANTLR is such a great tool for building a DSL - Domain-Specific Language (or our own language). From a text file, called the grammar, it generates the classes responsible for parsing and walking through the language.
The grammar is composed of two types of rules:
- Lexer
- Parser
There are several interesting articles about using ANTLR. Finding something useful about how to unit test it seems pretty hard though.
And as of any component of a Software, it is important to test each part of a given language. Especially complex components as a custom language.
Most articles out there shows only how to test the listener or the integrated solution.
This article shows a way make a unit test for the Lexer.
Lexer
The Lexer part is responsible for converting a character or string into tokens. These rules are the first ones to be processed.
Lexer rules are described as UPPER_CASE. They might be either in the same files as parser rules or a separate file.
For instance, for the rule BOOLEAN : 'true' | 'false' ;
, when the Lexer processes a text file, when it matches a true or a false string, it will turn it into a BOOLEAN token.
More Examples:
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
OPEN_BRACE : '{' ;
CLOSE_BRACE : '}' ;
BOOLEAN : 'true' | 'false' ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
Therefore, testing a Lexer is meant to check whether a text is interpreted as the expected token sequence.
Reading and Listening to errors
Code samples here are made using Kotlin. As it runs over the Jvm, they are similar to Java. So it should be easily understandable by Java developers.
At first, let’s create a method with some boilerplate code to invoke the Lexer processing:
private fun getTokensFromText(txt: String): List<Token> {
val iStream = /*new */ByteArrayInputStream(txt.toByteArray())
val cStream = /*new */CharStreams.fromStream(iStream)
val lex = /*new */MyLexer(cStream)
lex.addErrorListener(errorListener) // listener
val tokenStream = CommonTokenStream(lex)
tokenStream.fill()
return tokenStream.tokens
}
This accepts a string (txt) and processes it into MyLexer.
Then we add an error listener to be aware of what may have happened in case of error. Without this, ANTLR would not say what is wrong.
This listener is just an implementation of ANTLRErrorListener
.
The method returns a list of tokens.
The tests
Given the above method, now we just need to test some strings and check the tokens got as the output.
fun testFunction() {
val tokens = getTokensFromText("function ()")
assertEquals(4, tokens.size) // includes EOF
assertEquals(MyLexer.WORD, tokens[0].type)
assertEquals(MyLexer.OPEN_PAR, tokens[1].type)
assertEquals(MyLexer.CLOSE_PAR, tokens[2].type)
}
Final notes
The most common mistake in Lexer when write some grammar is leaving some ambiguity. With this code, it’s possible and easy to verify this problem.
REMEBER: ANTLR Lexer tries to match the largest possible token
This article is based on my open source project Tevim. Check out:
- The Lexer grammar
- The LexerTest
!EOF