Spring Boot integrates antlr for lexical and syntactic analysis

HBLOG
8 min readJun 7, 2024

--

1. What is ANTLR?

Antlr4 is a powerful syntax generator tool that can be used to read, process, execute, and translate structured text or binary files. It is basically the most widely used syntax generator tool in the Java language today. Twitter search uses ANTLR for syntax analysis, processing more than 2 billion queries per day; The languages used by Hive, Pig, data warehousing, and analytics systems in the Hadoop ecosystem all use ANTLR; Lex Machina uses ANTLR for the analysis of legal texts; Oracle uses ANTLR in SQL developer IDEs and migration tools; NetBeans’ IDE uses ANTLR to parse C++; The Hibernate Object-Relational Mapping Framework (ORM) uses ANTLR to handle the HQL language

Basic concepts

A parser is a program used to recognize a language, and it consists of two parts: lexer and parser. The lexical analysis stage mainly solves keywords and various identifiers, such as INTs, IDs, etc., and the grammatical analysis is mainly based on the results of lexical analysis to construct a grammatical analysis tree. The general process is shown belowRef. 2show.

Therefore, in order for lexical analysis and syntax analysis to work, when using Antlr4, you need to define grammar, which is the Antlr metalanguage.

The basic process of programming with ANTLR4 is fixed and is usually divided into three steps:

  • Write semantic rules of custom syntax according to ANTLR4 rules according to your needs, and save them as files with g4 as the suffix.
  • Use the ANTLR4 tool to process g4 files, generate lexers, syntax analyzer codes, dictionary files.
  • Write code that inherits from the Visitor class or implements the Listener interface to develop your own business logic code.

The difference between Listener mode and Visitor mode

Listener Mode:

Visitor Mode:

  • The Listener pattern traverses itself through the walker object, regardless of its syntax tree hierarchy. Vistor needs to control which child nodes it accesses, and if one child node is missed, the entire child node is inaccessible.
  • The method in Listener mode does not have a return value, and Vistor mode can set an arbitrary return value.
  • The access stack of the Listener pattern is clear, and the Vistor pattern is the method call stack, and if the implementation goes wrong, it is possible to cause StackOverFlow.

2. Code engineering

Objective: To implement an ANTLR-based calculator

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>springboot-demo</artifactId>
<groupId>com.et</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>ANTLR</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<antlr4.version>4.9.1</antlr4.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-autoconfigure</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>${antlr4.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>${antlr4.version}</version>
<configuration>
<sourceDirectory>src/main/java</sourceDirectory>
<outputDirectory>src/main/java</outputDirectory>
<arguments>
<argument>-visitor</argument>
<argument>-listener</argument>
</arguments>
</configuration>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

Metalanguage LabeledExpr.g4

grammar LabeledExpr; // rename to distinguish from Expr.g4
prog: stat+ ;
stat: expr NEWLINE # printExpr
| ID '=' expr NEWLINE # assign
| NEWLINE # blank
;
expr: expr op=('*'|'/') expr # MulDiv
| expr op=('+'|'-') expr # AddSub
| INT # int
| ID # id
| '(' expr ')' # parens
;
MUL : '*' ; // assigns token name to '*' used above in grammar
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace

Let’s take a quick look at the LabeledExpr.g4 file. ANTLR4 rules are defined based on regular expression definitions. Rules are understood top-down, with statements ending with each semicolon representing a rule. For example, the first line:grammar LabeledExpr; This means that our syntax name is LabeledExpr, which needs to be the same as the file name. Java coding has a similar rule: the class name is the same as the class file.

  • Rule prog Indicates that the prog is one or more stats.
  • Rule stat Adapt to three seed rules: blank line, expression expr, and assignment expression ID’=’expr.
  • Expression expr Adaptation to the five seed rules: multiplication and division, addition and subtraction, integer, ID, and parenthesis expression. Obviously, this is a recursive definition.

Finally, the basic elements that make up a composite rule, such as a rule, are defined ID: [a-zA-Z]+Indicates that IDs are limited to uppercase and lowercase strings;INT: [0–9]+; The rule for INT is one or more numbers between 0–9, but this definition is not strict. To be stricter, its length should be limited.

On the basis of understanding regular expressions, the g4 syntax rules of ANTLR4 are relatively easy to understand.

Defining ANTLR4 rules requires a case where a string may support multiple rules at the same time, such as the following two rules:

ID: [a-zA-Z]+;

FROM: ‘from’;

Obviously, the string “from” satisfies both of the above rules, and ANTLR4 handles it in such a way that it is determined in the order in which it is defined. Here the ID is defined before FROM, so the string from will be matched to the ID rule first.

In fact, after the g4 document is written, ANTLR4 has done 50% of the work for us: it has helped us implement the entire architecture and interfaces, and the rest of the development work is based on the specific implementation of interfaces or abstract classes. There are two ways to handle the generated syntax tree, one is the Visitor mode and the other is the Listener mode.

Generate lexical and grammatical parsers

Generated based on the Maven plug-in

<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>${antlr4.version}</version>
<configuration>
<sourceDirectory>src/main/java</sourceDirectory>
<outputDirectory>src/main/java</outputDirectory>
<arguments>
<argument>-visitor</argument>
<argument>-listener</argument>
</arguments>
</configuration>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
</execution>
</executions>
</plugin>

Execute the command

mvn antlr4:antlr4

Generated using the ideal plugin

Implement the operation logic

The first is based on the visitor

package com.et.antlr;
import java.util.HashMap;
import java.util.Map;
public class EvalVisitor extends LabeledExprBaseVisitor<Integer> {
// Store variables (for assignment)
Map<String, Integer> memory = new HashMap<>();
/** stat : expr NEWLINE */
@Override
public Integer visitPrintExpr(LabeledExprParser.PrintExprContext ctx) {
Integer value = visit(ctx.expr()); // evaluate the expr child
// System.out.println(value); // print the result
return value; // return dummy value
}
/** stat : ID '=' expr NEWLINE */
@Override
public Integer visitAssign(LabeledExprParser.AssignContext ctx) {
String id = ctx.ID().getText(); // id is left-hand side of '='
int value = visit(ctx.expr()); // compute value of expression on right
memory.put(id, value); // store it in our memory
return value;
}
/** expr : expr op=('*'|'/') expr */
@Override
public Integer visitMulDiv(LabeledExprParser.MulDivContext ctx) {
int left = visit(ctx.expr(0)); // get value of left subexpression
int right = visit(ctx.expr(1)); // get value of right subexpression
if (ctx.op.getType() == LabeledExprParser.MUL) return left * right;
return left / right; // must be DIV
}
/** expr : expr op=('+'|'-') expr */
@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
int left = visit(ctx.expr(0)); // get value of left subexpression
int right = visit(ctx.expr(1)); // get value of right subexpression
if (ctx.op.getType() == LabeledExprParser.ADD) return left + right;
return left - right; // must be SUB
}
/** expr : INT */
@Override
public Integer visitInt(LabeledExprParser.IntContext ctx) {
return Integer.valueOf(ctx.INT().getText());
}
/** expr : ID */
@Override
public Integer visitId(LabeledExprParser.IdContext ctx) {
String id = ctx.ID().getText();
if (memory.containsKey(id)) return memory.get(id);
return 0; // default value if the variable is not found
}
/** expr : '(' expr ')' */
@Override
public Integer visitParens(LabeledExprParser.ParensContext ctx) {
return visit(ctx.expr()); // return child expr's value
}
/** stat : NEWLINE */
@Override
public Integer visitBlank(LabeledExprParser.BlankContext ctx) {
return 0; // return dummy value
}
}

The second is based on the listener

package com.et.antlr;
import org.antlr.v4.runtime.tree.ParseTreeProperty;
import org.antlr.v4.runtime.tree.TerminalNode;
import java.util.HashMap;
import java.util.Map;
public class EvalListener extends LabeledExprBaseListener {
// Store variables (for assignment)
private final Map<String, Integer> memory = new HashMap<>();
// Store expression results
private final ParseTreeProperty<Integer> values = new ParseTreeProperty<>();
private int result=0;
@Override
public void exitPrintExpr(LabeledExprParser.PrintExprContext ctx) {
int value = values.get(ctx.expr());
//System.out.println(value);
result=value;
}
public int getResult() {
return result;
}
@Override
public void exitAssign(LabeledExprParser.AssignContext ctx) {
String id = ctx.ID().getText();
int value = values.get(ctx.expr());
memory.put(id, value);
}
@Override
public void exitMulDiv(LabeledExprParser.MulDivContext ctx) {
int left = values.get(ctx.expr(0));
int right = values.get(ctx.expr(1));
if (ctx.op.getType() == LabeledExprParser.MUL) {
values.put(ctx, left * right);
} else {
values.put(ctx, left / right);
}
}
@Override
public void exitAddSub(LabeledExprParser.AddSubContext ctx) {
int left = values.get(ctx.expr(0));
int right = values.get(ctx.expr(1));
if (ctx.op.getType() == LabeledExprParser.ADD) {
values.put(ctx, left + right);
} else {
values.put(ctx, left - right);
}
}
@Override
public void exitInt(LabeledExprParser.IntContext ctx) {
int value = Integer.parseInt(ctx.INT().getText());
values.put(ctx, value);
}
@Override
public void exitId(LabeledExprParser.IdContext ctx) {
String id = ctx.ID().getText();
if (memory.containsKey(id)) {
values.put(ctx, memory.get(id));
} else {
values.put(ctx, 0); // default value if the variable is not found
}
}
@Override
public void exitParens(LabeledExprParser.ParensContext ctx) {
values.put(ctx, values.get(ctx.expr()));
}
}

The above are just some of the key codes, all of which can be found in the repositories below

Code repositories

3. Testing

Test the Vistor mode

package com.et.antlr; /***
* Excerpted from "The Definitive ANTLR 4 Reference",
* published by The Pragmatic Bookshelf.
* Copyrights apply to this code. It may not be used to create training material,
* courses, books, articles, and the like. Contact us if you are in doubt.
* We make no guarantees that this code is fit for any purpose.
* Visit http://www.pragmaticprogrammer.com/titles/tpantlr2 for more book information.
***/
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTree;
import java.io.FileInputStream;
import java.io.InputStream;
public class CalcByVisit {
public static void main(String[] args) throws Exception {
/* String inputFile = null;
if ( args.length>0 ) inputFile = args[0];
InputStream is = System.in;
if ( inputFile!=null ) is = new FileInputStream(inputFile);*/
ANTLRInputStream input = new ANTLRInputStream("1+2*3\n");
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
EvalVisitor eval = new EvalVisitor();
int result =eval.visit(tree);
System.out.println(result);
}
}

Test the listener mode

package com.et.antlr;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
/**
* @author liuhaihua
* @version 1.0
* @ClassName CalbyLisenter
* @Description todo
*/
public class CalbyLisener {
public static void main(String[] args) throws IOException {
/* String inputFile = null;
if ( args.length>0 ) inputFile = args[0];
InputStream is = System.in;
if ( inputFile!=null ) is = new FileInputStream(inputFile);*/
ANTLRInputStream input = new ANTLRInputStream("1+2*3\n");
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
ParseTreeWalker walker = new ParseTreeWalker();
EvalListener evalListener =new EvalListener();
walker.walk(evalListener, tree);
int result=evalListener.getResult();
System.out.println(result);
}
}

Run the above test case and the calculation results are as expected

4. References

--

--