More than 5 years have passed since last update.

ANTLRでTSQLを解析してみる #01 公開されているTSQL文法ファイルを大文字小文字問わないよう改造

Last updated at 2020-06-14Posted at 2020-06-14

ANTLR用の様々な文法が公開されており、TSQLの文法も有る。
ANTLRの構文木の表示ツールでの表示を通して、ANTLRでのTSQLの解析を試してみる。

準備

antlr4,grunコマンド用意

Getting Started with ANTLR v4に従い、antlr4,grunコマンドを用意。

文法ファイルダウンロード

An ANTLR4 grammar for T-SQLから下記をダウンロード。

TSqlLexer.g4
TSqlParser.g4

コンパイル

antlr4 TSqlLexer.g4 TSqlParser.g4
javac TSql*.java

解析してみる

下記の要領で解析できた。

grun TSql sql_clause -gui
SELECT A,B FROM TAB
^Z

しかし、テーブル名や列名に小文字を使うと解析エラーが起きる。
また、SELECTやFROMのステートメントに小文字を使っても解析エラーが起きる。
SELECT a,b FROM TAB
select A,B FROM tab

かぎかっこで囲むと英小文字でも解析できる。
SELECT [a],[b] FROM TAB
SELECT A,B FROM [tab]

ANTLRの大文字小文字について

Case-Insensitive Lexingによると、ANTLRには大文字小文字を区別しないための2つの方法が有るとのこと。
下記はDeepLを元にした翻訳と、原文。

大文字小文字のどちらにもマッチするようlexical ruleを作成する

メリット：
ANTLRに変更を加える必要がない。また、この言語が大文字小文字区別しない事を文法上で明確にできる。
デメリット：
多少の費用対効果は有るかもしれないが、文法が冗長になり、書くのも面倒になる。

大文字でlexical ruleを作成し、小文字を大文字に変換してlexerに渡すよう文字ストリームをカスタムする。ただし文字列やコメント内の文字は影響を受けないようにしなければならないので、ストリーム内のすべての文字を大文字に変換しないように注意しなければならない。必要なことは、入力がすべて大文字であるとlexerに思わせることだ。

メリット:
実装によるが処理速度の面で優れる。また、文法を変更する必要がない。
デメリット:
大文字小文字を区別しないストリームと文法を正しく使用する必要がある。大文字小文字を区別しないストリームと文法がお互いに正しく使用されている必要があります。

以下原文。

Build lexical rules that match either upper or lower case.

Advantage:
no changes required to ANTLR, makes it clear in the grammar that the language in this case insensitive.
Disadvantage:
might have a small efficiency cost and grammar is a more verbose and more of a hassle to write.

Build lexical rules that match keywords in all uppercase and then parse with a custom character stream that converts all characters to uppercase before sending them to the lexer (via the LA() method). Care must be taken not to convert all characters in the stream to uppercase because characters within strings and comments should be unaffected. All we really want is to trick the lexer into thinking the input is all uppercase.

Advantage:
Could have a speed advantage depending on implementation, no change required to the grammar.
Disadvantage:
Requires that the case-insensitive stream and grammar are used in correctly in conjunction with each other, makes all characters appear as uppercase/lowercase to the lexer but some grammars are case sensitive outside of keywords, errors new case insensitive streams and language output targets (java, C#, C++, ...).

ステートメントが大文字小文字問わないようにする

コメントや文字列への配慮が難しそうなので、大文字小文字のどちらにもマッチするようlexical ruleを作成する方を進めてみる。

前述のページの記載に従い、

TSqlLexer.g4

fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
・・・

のように定義した上で、

TSqlLexer.g4

SELECT: 'SELECT';

を

TSqlLexer.g4

SELECT: S E L E C T;

と定義し直す。

ちなみにfragmentは、単独では使われず他から使われるための定義、のようなものらしい。

ただし
Rは既に定義されている。

TSqlLexer.g4

R:                                     'R';

よって、別名で定義する

TSqlLexer.g4

fragment F_R : [rR];
・・・
ALTER:                                 A L T E F_R ;

また

TSqlLexer.g4

EXECUTE:                               'EXEC' 'UTE'?;

のように、?や+ *については取り扱い注意。

本来はGrammarファイルの文法を定義したファイルを作成し構文解析して変換するところだが、今回はRubyスクリプトやVimの置換機能で変換した。よって漏れや間違いが有る可能性が有る。
しかし動作に問題が無さそうなので、一旦このまま用いる。

実行結果

..\grun TSql sql_clause -gui
select A,B FROM TAB
^Z

テーブル名や列名を大文字小文字区別しないようにする

TSqlLexer.g4

ID:                  ( [A-Z_#] | FullWidthLetter) ( [A-Z_#$@0-9] | FullWidthLetter )*;

これを、下記のように変更。

TSqlLexer.g4

ID:                  ( [A-Za-z_#] | FullWidthLetter) ( [A-Za-z_#$@0-9] | FullWidthLetter )*;

実行結果

..\grun TSql sql_clause -gui
SELECT a,b FROM TAB
^Z

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up