More than 5 years have passed since last update.

pyparsingをAntlr4で置き換えて性能を5倍にした

Last updated at 2018-02-06Posted at 2017-02-14

検索クエリのparse処理の性能を上げる必要があり、pyparsingを使ってparse処理していたところをAntlr4にすることで処理性能が5倍ほど向上した。

Antlr4は自分としてはとっつきにくいイメージがあり、実際使いこなしに結構ハマったので、使い方を共有します。改善点があれば是非突っ込んでいただけると嬉しいです！

pyparsingを使ってましたが、Antlr4への置き換えでもPythonのコードを生成してpython runtimeで実行しています。

Antlrとは

ANTLR(ANother Tool for Language Recognition)で、parserのgeneratorである
字句解析器、文法解析機を定義すると、Java、Python(2,3)、Goなどいつくかの言語のparserを生成することができる
- ドキュメント的にはPython3.4対応と書いてあるが、3.5でも手元では動いている

セットアップ

» brew install antlrでコマンドラインツールが入る。windowsは分からない。
pythonで使うには» pip install antlr4-python3-runtimeでOK

開発

1. LexerとParserの定義を行う

Query.g4のようなg4ファイルを生成する。この例では、LexerとParserを同一ファイルに定義しているが、分割することもできる。

注) 各ルールの右にコメントがあるが、Javaのコード生成時にはエラーになるので、このコメントを削除する必要がある。Pythonのときはこのコメントに従って、Listenerのイベントに対応したメソッドが生成されるので必要。

Query.g4

grammar Query;

// Parser

query: expr+ ;

expr: '(' expr ')'   # PARENTHESES
  | FIELD expr       # FIELD
  | NOT expr         # NOT
  | expr AND? expr   # AND
  | expr OR expr     # OR
  | REGEXP           # REGEXP
  | TEXT             # TEXT
  | STRING           # STRING
  ;


// Lexer

AND: 'AND' ;
OR: 'OR' ;
NOT: 'NOT' ;
FIELD : TEXT ':' ;

TEXT : ~[/()\n\r" ]+ ;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
REGEXP : '/' (~'/')* '/' ; 
WS  : [ \t\r\n]+ -> skip ;

2. コードを生成

» antlr4 Query.g4を実行するとカレントディレクトリに、以下のような.javaファイルが生成される。

 Query.g4                  QueryLexer.class      QueryListener.java                QueryParser.java
 Query.tokens              QueryLexer.java      'QueryParser$ExprContext.class'
 QueryBaseListener.class   QueryLexer.tokens    'QueryParser$QueryContext.class'
 QueryBaseListener.java    QueryListener.class   QueryParser.class

3. コンパイル

» javac ./*.java

4. 対話的に文字列をパース

これでLexerとParserをデバッグしながら試行錯誤していく。

» grun Query query -tree
A AND B OR C
(query (expr (expr (expr A) AND (expr B)) OR (expr C)))

5. Pythonコードを生成

» antlr4 -Dlanguage=Python3 Query.g4でpythonコードが生成される

確認用コードはこんな感じ。InputStreamだと文字列を引数にできるが、FileStreamだとファイルのパスを渡してファイルを読み込んでの処理もできる。

import ipdb
import sys
from antlr4 import *
from QueryLexer import QueryLexer
from QueryParser import QueryParser

def main(argv):
    input = InputStream(argv[1])
    lexer = QueryLexer(input)
    stream = CommonTokenStream(lexer)
    parser = QueryParser(stream)
    tree = parser.query()
    ipdb.set_trace()
    print(tree)

if __name__ == '__main__':
    main(sys.argv)

toStringTreeメソッドでLISP formatなパース結果を返してくれる。

ipdb> tree.toStringTree()
'([] ([4] A) ([4]   ([12 4] AND ([12 12 4]   ([12 12 12 4] B)))) ([4]   ([12 4] OR ([14 12 4]   ([12 14 12 4] C)))))'

getChildで子要素の情報を取得できる。

ipdb> tree.getChild(0).getText()
'A'
ipdb> tree.getChild(0).toStringTree()
'([4] A)'

6. Parseした結果を取得

QueryListener.pyというファイルが生成される。実際に文字列をparse処理すると、構文ルール(ex: ANDとか)に応じて各メソッドが呼び出される。実際にためして、printデバッグなり、break pointをはるなりして動きをみるのが理解するのに早いと思う。
自分が必要な処理をこのクラスに実装していく。

QueryListener.py

# Generated from Query.g4 by ANTLR 4.6
from antlr4 import *
if __name__ is not None and "." in __name__:
    from .QueryParser import QueryParser
else:
    from QueryParser import QueryParser

# This class defines a complete listener for a parse tree produced by QueryParser.
class QueryListener(ParseTreeListener):

    # Enter a parse tree produced by QueryParser#query.
    def enterQuery(self, ctx:QueryParser.QueryContext):
        pass

    # Exit a parse tree produced by QueryParser#query.
    def exitQuery(self, ctx:QueryParser.QueryContext):
        pass


    # Enter a parse tree produced by QueryParser#NOT.
    def enterNOT(self, ctx:QueryParser.NOTContext):
        pass

    # Exit a parse tree produced by QueryParser#NOT.
    def exitNOT(self, ctx:QueryParser.NOTContext):
        pass


    # Enter a parse tree produced by QueryParser#REGEXP.
    def enterREGEXP(self, ctx:QueryParser.REGEXPContext):
        pass

    # Exit a parse tree produced by QueryParser#REGEXP.
    def exitREGEXP(self, ctx:QueryParser.REGEXPContext):
        pass


    # Enter a parse tree produced by QueryParser#OR.
    def enterOR(self, ctx:QueryParser.ORContext):
        pass

    # Exit a parse tree produced by QueryParser#OR.
    def exitOR(self, ctx:QueryParser.ORContext):
        pass


    # Enter a parse tree produced by QueryParser#PARENTHESES.
    def enterPARENTHESES(self, ctx:QueryParser.PARENTHESESContext):
        pass

    # Exit a parse tree produced by QueryParser#PARENTHESES.
    def exitPARENTHESES(self, ctx:QueryParser.PARENTHESESContext):
        pass


    # Enter a parse tree produced by QueryParser#FIELD.
    def enterFIELD(self, ctx:QueryParser.FIELDContext):
        pass

    # Exit a parse tree produced by QueryParser#FIELD.
    def exitFIELD(self, ctx:QueryParser.FIELDContext):
        pass


    # Enter a parse tree produced by QueryParser#AND.
    def enterAND(self, ctx:QueryParser.ANDContext):
        pass

    # Exit a parse tree produced by QueryParser#AND.
    def exitAND(self, ctx:QueryParser.ANDContext):
        pass


    # Enter a parse tree produced by QueryParser#STRING.
    def enterSTRING(self, ctx:QueryParser.STRINGContext):
        pass

    # Exit a parse tree produced by QueryParser#STRING.
    def exitSTRING(self, ctx:QueryParser.STRINGContext):
        pass


    # Enter a parse tree produced by QueryParser#TEXT.
    def enterTEXT(self, ctx:QueryParser.TEXTContext):
        pass

    # Exit a parse tree produced by QueryParser#TEXT.
    def exitTEXT(self, ctx:QueryParser.TEXTContext):
        pass

全部の処理は載せられないのだが、自分の場合はとりあえず以下のような実装にした。

class QueryListener(ParseTreeListener):

    def __init__(self):
        self.query_0 = []
        self.context = []

    def _enter_context(self, con):
        self.context.append(con)
        setattr(self, 'query_{}'.format(len(self.context) - 1), [])

    def _exit_context(self, flat=False):
        tmp_query = getattr(self, 'query_{}'.format(len(self.context) - 1), None)
        if len(self.context) >= 2:
            parent_query = getattr(self, 'query_{}'.format(len(self.context) - 2), None)
            if self.context[-1] == self.context[-2] or flat:
                for i in tmp_query:
                    parent_query.append(i)
            else:
                parent_query.append(tmp_query)
        self.context.pop()

    def visitTerminal(self, node:TerminalNode):
        if str(node) == '(' or str(node) == ')':
            return
        ctx_length = len(self.context)
        if ctx_length == 0 or ctx_length == 1:
            self.query_0.append(str(node))
        else:
            tmp_query = getattr(self, 'query_{}'.format(len(self.context) - 1))
            tmp_query.append(str(node))

    # Enter a parse tree produced by QueryParser#NOT.
    def enterNOT(self, ctx:QueryParser.NOTContext):
        self._enter_context('NOT')

    # Exit a parse tree produced by QueryParser#NOT.
    def exitNOT(self, ctx:QueryParser.NOTContext):
        self._exit_context()

...

Antlr4とpyparsingのI/Fの違い

pyparsingの場合は、パース結果としてpyparsing.ParseResultsが返ってきて、#asList()を呼ぶと、良い感じのListが返ってくる。

ex: A B OR C => [['A', 'B'], 'OR', 'C']

Antlr4の場合は、パースした結果からこの配列を得る部分を自分で実装しないといけないのが割りと手間であったが、自由度があるとも言える。

Benchmark

benchの結果は以下の通り

» python test.py
## benchmarker:         release 4.0.1 (for python)
## python version:      3.5.1
## python compiler:     GCC 4.2.1 (Apple Inc. build 5666) (dot 3)
## python platform:     Darwin-15.6.0-x86_64-i386-64bit
## python executable:   .../antlr/query/venv.d/bin/python
## cpu model:           Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
## parameters:          loop=1000, cycle=1, extra=0

##                        real    (total    = user    + sys)
antlr                  34.1303   34.0900   34.0500    0.0400
pyparsing             159.7840  159.5700  158.7100    0.8600

## Ranking                real
antlr                  34.1303  (100.0) ********************
pyparsing             159.7840  ( 21.4) ****

## Matrix                 real    [01]    [02]
[01] antlr             34.1303   100.0   468.2
[02] pyparsing        159.7840    21.4   100.0

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up