BackGround

みんな大好きbufio.Scanner
Gopherなみなさんは、きっと使ったことがありますよね??
bufioはbyte列の処理や、テキスト処理のユーティリティを提供してくれるpackageです。

bufio.Scan()は、デフォルトではCRLF,LFの二つの改行コードにのみ対応していて、
CRのみでは改行とみなされないので意図した処理結果とはなりません。

そこで、buffo.Scannerへは任意の分割処理を定義した関数が渡せるようになっています。

func(*Scanner)Split

func (s *Scanner) Split(split SplitFunc)
Split sets the split function for the Scanner. The default split function is ScanLines.

Split panics if it is called after scanning has started.

type SplitFunc

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
SplitFunc is the signature of the split function used to tokenize the input. The arguments are an initial substring of the remaining unprocessed data and a flag, atEOF, that reports whether the Reader has no more data to give. The return values are the number of bytes to advance the input and the next token to return to the user, plus an error, if any. If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

If the returned error is non-nil, scanning stops and the error is returned to the client.

The function is never called with an empty data slice unless atEOF is true. If atEOF is true, however, data may be non-empty and, as always, holds unprocessed text.

これらを利用して、ユーザ定義のSplitFuncをSplitへ渡すことで任意の処理で入力を分割していくことができます。
今回は、そのユーザ定義の部分をタイトルのような場合に対応できるようなものを実装してみました。

SourceCodes

では、まずデフォルトのSplitFuncに該当する関数を見てみます。

// dropCR drops a terminal \r from the data.
  func dropCR(data []byte) []byte {
    if len(data) > 0 && data[len(data)-1] == '\r' {
        return data[0 : len(data)-1]
    }
    return data
  }

// ScanLines is a split function for a Scanner that returns each line of
  // text, stripped of any trailing end-of-line marker. The returned line may
  // be empty. The end-of-line marker is one optional carriage return followed
  // by one mandatory newline. In regular expression notation, it is `\r?\n`.
  // The last non-empty line of input will be returned even if it has no
  // newline.
  func ScanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexByte(data, '\n'); i >= 0 {
        // We have a full newline-terminated line.
        return i + 1, dropCR(data[0:i]), nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), dropCR(data), nil
    }
    // Request more data.
    return 0, nil, nil
  }

この実装ではとにかくLF(\n)が読めたら直前のCRを削除して1行として返すようになっています。
当然CR(\r)のみで区切られた行はそこで区切られることはありません。

そこで以下で分割処理を置き換えてみます。

// isLF returns weather b is LF
func isLF(b byte) bool {
    if b == '\n' {
        return true
    }
    return false
}

// scanLinesCustom is a splitFunc corresponding to all line feed codes of CRLF, LF, CR
func scanLinesCustom(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    i, j := bytes.IndexByte(data, '\n'), bytes.IndexByte(data, '\r')
    if i < j {
        // if LF
        if i >= 0 {
            return i + 1, data[0:i], nil
        } else {
            // if CRLF
            if j < len(data)-1 && isLF(data[j+1]) {
                return j + 2, data[0:j], nil
            }
            // if CR
            return j + 1, data[0:j], nil
        }
    } else if j < i {
        if j >= 0 {
            // if CRLF
            if j < len(data)-1 && isLF(data[j+1]) {
                return j + 2, data[0:j], nil
            }
            // if CR
            return j + 1, data[0:j], nil
            // if LF
        } else {
            return i + 1, data[0:i], nil
        }
    } else {
        // this case is only "i == -1 && j == -1"
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}

この実装では、常にLF(\n)とCR(\r)の最短出現インデックスを求め、それらを比較して分割を行うようにしました。
つねにLF(\n)とCR(\r)が出現することを想定しているので、仮にそれらが混在していても問題なく動作します。

結果的にパスされたCRをdropCR()でクリアする必要は無くなり、代わりにCRを拾った場合に次を読んでCRLFかどうかを判断します。
CRLFだった場合は、CRの次のLFは無視したいので、advanceは2つ進めてあげます。

あまりスマートなコードではありませんが、以下のテストパターンはクリアできているので、実用性としては問題はなさそうです。(パターンが漏れていたら教えてください)

func TestScanLinesCustom(t *testing.T) {
    type tp struct {
        input string
        want  string
        got   string
    }
    tests := []tp{
        // 1~4 : no lb or one at end
        tp{
            input: "abcdefghijkl",
            want:  "abcdefghijkl\n",
        },
        tp{
            input: "abcdefghijkl\r\n",
            want:  "abcdefghijkl\n",
        },
        tp{
            input: "abcdefghijkl\n",
            want:  "abcdefghijkl\n",
        },
        tp{
            input: "abcdefghijkl\r",
            want:  "abcdefghijkl\n",
        },
        // 5~7 : top
        tp{
            input: "\r\nabcdefghijkl",
            want:  "\nabcdefghijkl\n",
        },
        tp{
            input: "\nabcdefghijkl",
            want:  "\nabcdefghijkl\n",
        },
        tp{
            input: "\rabcdefghijkl",
            want:  "\nabcdefghijkl\n",
        },
        // 8~11 : top and buttom
        tp{
            input: "\r\nabcdefghijkl\r\n",
            want:  "\nabcdefghijkl\n",
        },
        tp{
            input: "\nabcdefghijkl\n",
            want:  "\nabcdefghijkl\n",
        },
        tp{
            input: "\rabcdefghijkl\r",
            want:  "\nabcdefghijkl\n",
        },
        tp{
            input: "\r\nabcdefghijkl\n",
            want:  "\nabcdefghijkl\n",
        },
        // 12 : only crlf
        tp{
            input: "abc\r\ndef\r\nghi\r\njkl\r\n",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 13: only lf
        tp{
            input: "abc\ndef\nghi\njkl\n",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 14 : only cr
        tp{
            input: "abc\rdef\rghi\rjkl\r",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 15 :  lf in crlf
        tp{
            input: "abc\r\ndef\nghi\r\njkl\r\n",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 16 : cr in crlf
        tp{
            input: "abc\r\ndef\rghi\r\njkl\r\n",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 17 : crlf in lf
        tp{
            input: "abc\ndef\nghi\r\njkl\n",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 18 : cr in lf
        tp{
            input: "abc\ndef\nghi\rjkl\n",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 19 : crlf in cr
        tp{
            input: "abc\rdef\rghi\r\njkl\r",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 20 : lf in cr
        tp{
            input: "abc\rdef\rghi\njkl\r",
            want:  "abc\ndef\nghi\njkl\n",
        },
        // 21 : crlf duplicate
        tp{
            input: "abc\r\ndef\r\nghi\r\n\r\njkl\r\n",
            want:  "abc\ndef\nghi\n\njkl\n",
        },
        // 22 : lf duplicate
        tp{
            input: "abc\ndef\nghi\n\njkl\n",
            want:  "abc\ndef\nghi\n\njkl\n",
        },
        // 23 : cr duplicate
        tp{
            input: "abc\rdef\rghi\r\rjkl\r",
            want:  "abc\ndef\nghi\n\njkl\n",
        },

        // 24 : cr duplicate in crlf
        tp{
            input: "abc\r\ndef\r\nghi\r\r\njkl\r\n",
            want:  "abc\ndef\nghi\n\njkl\n",
        },
        // 25 : lf duplicate in crlf
        tp{
            input: "abc\r\ndef\r\nghi\r\n\njkl\r\n",
            want:  "abc\ndef\nghi\n\njkl\n",
        },
        // 26 : crlf duplicate in crlf
        tp{
            input: "abc\r\ndef\r\nghi\r\n\r\njkl\r\n",
            want:  "abc\ndef\nghi\n\njkl\n",
        },
    }
    for i, e := range tests {
        e.got = ""
        scanner := bufio.NewScanner(strings.NewReader(e.input))
        scanner.Split(scanLinesCustom)
        for scanner.Scan() {
            e.got += scanner.Text() + "\n"
        }
        if e.got != e.want {
            t.Errorf("case %d got :\n%s\nwant :\n%s\n", i+1, e.got, e.want)
        }
    }
}

ちなみにSplitFuncの差し込みイメージはこんな感じです。
いつもの流れにSplit(SplitFunc)足して頂く感じです。

func main() {

    if len(os.Args) == 1 {
        log.Fatal("require input")
    }
    fp, err := os.Open(os.Args[1])
    if err != nil {
        log.Fatal(err)
    }
    defer fp.Close()
    scanner := bufio.NewScanner(fp)
    scanner.Split(scanLinesCustom)
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
}

今時CRで改行とかねーよ

はい、そうかもしれません。

より一般化されたロジックの方がいいよね、と思ったまでです。

Goはたのしいですね。

Goのbufio.ScannerをCRLF,LF,CRの混在に対応させる

BackGround

func(*Scanner)Split

type SplitFunc

SourceCodes

今時CRで改行とかねーよ