Scalaでグループ数安全な正規表現パターンマッチをする(regex-refined)

Last updated at 2018-12-06Posted at 2018-12-06

はじまり

Scalaでは正規表現パターンマッチがありますよね。

val date = raw"(\d{4})-(\d{2})-(\d{2})".r

"2004-01-20" match {
  case date(year, month, day) => s"$year was a good year for PLs."
}

Regexのdocにある例です。でもRegexのパターンマッチはunapplySeqなのでdateには case date(p,q,r,s,t) => ... のように実際のグループ数と異なる数の引数を与えても実行時までエラーにはなりません。これはいけません、正規表現の文字列リテラルはさっき与えたばっかりなのに、、、コンパイル時に判定してほしくないですか？というわけでそういうパターンマッチを作りました。

できたもの

"2018-11-18" match {
    case r"""(\d{4}$year)-(\d{2}$month)-(\d{2}$day)""" => println(s"year = $year, month = $month, day = $day")
    case _ => println("no!")
}

rがstring interpolator + extractorです。rに与えた正規表現文字列のグループ数と束縛する変数の数が異なる場合にコンパイルエラーになります。上ではおしゃれな書き方をしていますが、正規表現文字列の部分は変数部分をすべて飛ばして繋げた結果を正規表現として解釈させているので、コメントフラグをonにしてr"""(?x) (\d{4})-(\d{2})-(\d{2}) # $year, $month, $day"""のように書くこともできます。

実装

string interpolator
whitebox macro
extractor macro
- name-based extractor

ここら辺を行ったり来たりすると・・・こうじゃ

package object macros {

  implicit class RegexContext(sc: StringContext) {
    def r: RegexContextExtractor = new RegexContextExtractor(sc)
  }

  class RegexContextExtractor(sc: StringContext) {
    def unapply(s: String): Any = macro MacroImpls.provideExtractorImitator
  }

  private[macros] object MacroImpls {

    def provideExtractorImitator(c: whitebox.Context)(s: c.Tree): c.Tree = {
      import c.universe.{Try => _, _}

      val regex = c.prefix.tree match {
        case q"$_.RegexContext($_.StringContext.apply(..${rawParts: List[Tree]})).r" =>
          rawParts.map{ case q"${const: String}" => const }.mkString
        case _ => c.abort(c.enclosingPosition, s"Invalid use of regex string extractor")
      }
      val count = Try(Pattern.compile(regex).matcher("").groupCount()).recover {
        case e: PatternSyntaxException => c.abort(c.enclosingPosition, e.getMessage)
      }.get
      val imitationMethods = (1 to count)
        .map(i => q"def ${TermName("_" + i)} = matched.get(${Literal(Constant(i - 1))})")
        .toList

      q"""
       new {
         var matched: Option[Array[String]] = None
         def isEmpty = matched.isEmpty
         def get = this
         ..$imitationMethods
         def unapply(s: String) = {
           matched = ${Literal(Constant(regex))}.r.unapplySeq(s).map(_.toArray)
           this
         }
       }.unapply($s)
     """
    }
  }
}

細かく説明します。

RegexContext

  implicit class RegexContext(sc: StringContext) {
    def r: RegexContextExtractor = new RegexContextExtractor(sc)
  }

まずこれはstring interpolatorのimplicit classです。string interpolatorはまず、docにもある通り

s"Hello, $name" → StringContext("Hello, ", "").s(name)

の書き換えが発生します。なのでこのようにimplicit classを定義してimportすると

r"Hello, $name" →
  new RegexContext(StringContext("Hello, ", "")).r(name)
  ≈ new RegexContextExtractor(StringContext("Hello, ", ""))(name)

となって、extractorに引数が適用される形になります。

RegexContextExtractor

  class RegexContextExtractor(sc: StringContext) {
    def unapply(s: String): Any = macro MacroImpls.provideExtractorImitator
  }

unapplyが定義されているのでこれはextractor(抽出子)です。ここでまず説明すべきなのが、unapplyの返り値型がAnyになっていることです。この型の本当の型（？）は右辺の(whitebox)マクロによって与えられます。whiteboxマクロではマクロ展開後のより具体的な型を採用させることができます。なんやて
というわけで右辺のマクロで正規表現文字列リテラルを取ってきて、グループ数を数えてそれに応じたunapplyっぽい返り値を返せばよいことになります。

マクロ関数の実装部分

    def provideExtractorImitator(c: whitebox.Context)(s: c.Tree): c.Tree = {
      import c.universe.{Try => _, _}

      val regex = c.prefix.tree match {
        case q"$_.RegexContext($_.StringContext.apply(..${rawParts: List[Tree]})).r" =>
          rawParts.map{ case q"${const: String}" => const }.mkString
        case _ => c.abort(c.enclosingPosition, s"Invalid use of regex string extractor")
      }
      val count = Try(Pattern.compile(regex).matcher("").groupCount()).recover {
        case e: PatternSyntaxException => c.abort(c.enclosingPosition, e.getMessage)
      }.get
      val imitationMethods = (1 to count)
        .map(i => q"def ${TermName("_" + i)} = matched.get(${Literal(Constant(i - 1))})")
        .toList

      q"""
       new {
         var matched: Option[Array[String]] = None
         def isEmpty = matched.isEmpty
         def get = this
         ..$imitationMethods
         def unapply(s: String) = {
           matched = ${Literal(Constant(regex))}.r.unapplySeq(s).map(_.toArray)
           this
         }
       }.unapply($s)
     """
    }

まずマクロの本体の関数はTreeを取ってTreeを返します。なので引数のsは抽出子に与えられる引数(name)の部分の構文木です。これはq(quasiquote)の中で new {...}.unapply($s) として再びunapplyの引数としておきます。
qの中でnew {...}としているのがunapplyの結果っぽいインスタンスです。ここで何がしたいのかというと、unapplyは通常Option[(T1,T2,...)]を返すので、それと同じメソッドを生やしたインスタンスを作っているんです。例えば正規表現文字列のグループ数が３だとすると

new {
  def isEmpty = ???
  def get = this
  def _1: String = ???
  def _2: String = ???
  def _3: String = ???
}

こういうインスタンスがあればもう実質Option[(String, String, String)]です。これがなぜできるのかというと、普遍的にそうというわけではなくて2.11からunapplyの返り値型はメソッド名しかみてないからでした...なんやてあとget = thisとかして単一のインスタンスにすることで無駄なインスタンス生成を抑えています。
それともう一つ解説すべき部分は正規表現文字列リテラルの取得のところです。c.prefix.treeはsのプレフィックスの構文木なので new RegexContext(StringContext("Hello, ", "").r のnewより右の部分です。これをq"$_.RegexContext($_.StringContext.apply(..${rawParts: List[Tree]})).r"でマッチさせるとrawPartsにList("Hello, ","")が束縛されます。クラスの修飾子部分などは省略できないので$_として変数束縛させて捨てています。（つまりこのマクロ関数実装はstring interpolatorからの呼び出ししか想定していないわけです。）マクロでリテラル値を取得する場合、大体c.prefix.treeを参照するとうまくいきます。
このqによるパターンマッチは普通にTreeのマッチで書くと

case Select(Apply(_, List(Apply(_, rawParts))), _) => ...

のようになります。
ここら辺の実装は会社のSlackで挙げられていたPlayのsirdを参考にしています。

おわり

というわけで無事にグループ数安全な正規表現パターンマッチができました。これは自作ライブラリregex-refinedに定義されているので使ってみてください

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up