スタンバイAdvent Calendar 2022

xmllintのエラーにPythonで立ち向かう

Last updated at 2024-06-28Posted at 2022-12-01

この記事はスタンバイ Advent Calendar2022の2日目の記事です。

動機

業務ではmacでxml形式のファイルを触る機会が多く、xmllintを使って検査したりしてます。
しかし、xmllintでエラーになってもエディタでは目視できない謎の存在がいたりします。
本記事ではPythonでこの謎の存在Xを退治したいと思います。

動作環境

macOS Monterey
Python 3.10.6

方針

全体を一気に読むと大変なのでとりあえずxmlファイルを1行ずつ読み込んでいく方針で行きます。

テストデータの用意

まずはそれっぽいxmlファイルを適当なディレクトリに用意します。
改行が続いてる部分に謎の存在Xがいます。

path/to/sample.xml

<?xml version='1.0' encoding='UTF-8'?>
<source>
    <section>
        <title><![CDATA[タイトル1]]></title>
        <paragraph><![CDATA[段落1







]]></paragraph>
    </section>
    <section>
        <title><![CDATA[タイトル2]]></title>
        <paragraph><![CDATA[段落2]]></paragraph>
    </section>
    <section>
        <title><![CDATA[タイトル3]]></title>
        <paragraph><![CDATA[段落3]]></paragraph>
    </section>
    <section>
        <title><![CDATA[タイトル4]]></title>
        <paragraph><![CDATA[段落4]]></paragraph>
    </section>
    <section>
        <title><![CDATA[タイトル5]]></title>
        <paragraph><![CDATA[段落5]]></paragraph>
    </section>
    <section>
        <title><![CDATA[タイトル6]]></title>
        <paragraph><![CDATA[段落6]]></paragraph>
    </section>
</source>

xmllintにかけてみる

xmllint

xmllint --noout path/to/sample.xml

sample.xmlをxmllintにかけた結果

結果

path/to/sample.xml:5: parser error : CData section not finished
段
        <paragraph><![CDATA[段落1
                                   ^
path/to/sample.xml:5: parser error : PCDATA invalid Char value 8
        <paragraph><![CDATA[段落1
                                   ^
path/to/sample.xml:6: parser error : PCDATA invalid Char value 11


^
path/to/sample.xml:7: parser error : PCDATA invalid Char value 6

^
path/to/sample.xml:8: parser error : PCDATA invalid Char value 4

^
path/to/sample.xml:9: parser error : PCDATA invalid Char value 2

^
path/to/sample.xml:10: parser error : PCDATA invalid Char value 1

^
path/to/sample.xml:11: parser error : PCDATA invalid Char value 16

^
path/to/sample.xml:12: parser error : PCDATA invalid Char value 12


^
path/to/sample.xml:13: parser error : PCDATA invalid Char value 3
]]></paragraph>
^
path/to/sample.xml:13: parser error : Sequence ']]>' not allowed in content
]]></paragraph>
 ^
path/to/sample.xml:13: parser error : internal error: detected an error in element content
]]></paragraph>
 ^

コードにしてみる

1行ずつ読み込む

with open(input_file, "r", encoding="utf-8") as f:
    for line in f:
        # lineの型も併せて確認
        print(type(line))

謎の存在Xを探す

文字で探してみる

とりあえず謎の存在Xをコピペして正規表現にしてみる。

main.py

# 空白に見えるけど複数あるよ
NG_CHARS = '<(||||||||)>'

with open(input_path, "r", encoding="utf-8") as f:
    for line in f:
        if re.search(NG_CHARS, line):
            print('invalid!')
        else:
            print(line)

しかし、引っかからない・・・
どうやら文字で探すのは無理なようなので、別のアプローチを考える。

10進数にして探してみる

こちらのサイトで対象の文字を10進数に変換する。

main.py

# xml内の文字列を10進数に変換
def string_to_ord(str):
    ord_list = []
    for char in str:
        ord_list.append(ord(char))
    return ord_list

謎の存在Xを削除する

main.py

NG_LIST = [1,2,3,4,6,8,11,12,15,16,17,18,19,20,29]
def remove_invalid_chars(NG_LIST, ord_list):
    clean_chars = []
    clean_chars = list(filter(lambda item: item not in NG_LIST,ord_list))
    return clean_chars

10進数から元に戻す

謎の存在を削除したリストを文字列に戻します。

main.py

def ord_to_string(ord_list):
    clean_string = ""
    string_list = []
    for item in ord_list:
        string_list.append(chr(item))
    clean_string = "".join(string_list)
    return clean_string

1行ずつ別ファイルに保存していく

main.py

def create_new_file():
    if os.path.exists(new_file_path):
        os.remove(new_file_path)
    output_file = pathlib.Path(new_file_path)
    output_file.touch()


def line_writer(file_name, line):
    with codecs.open(file_name, 'a', 'utf-8') as f:
        f.write(line)

main.py

create_new_file()

with open(input_path, "r", encoding="utf-8") as f:
    for line in f:
        ord_list = string_to_ord(line)
        removed_list = remove_invalid_chars(NG_LIST, ord_list)
        clean_string = ord_to_string(removed_list)
        line_writer(new_file_path, clean_string)

実際に動かしてみる

$ py path/to/main.py path/to/sample.xml

結果を検証

再度xmllintにかけてみる

xmllint

xmllint --noout path/to/clean_sample.xml; echo $?

再度xmllintにかけた結果

結果

# 0ならば正常終了
0

差分をチェック

謎の存在X以外に影響がないことを確認できました。

後書き

すでにお気づきだと思いますが、謎の存在Xとは『制御文字』と呼ばれるものです。
xmllintではこれが含まれているとエラーとなります。
このエラーを回避する方法が見当たらず、今回のようにスクリプトを用意して頑張ってみるという結果になりました。
NG_LISTにしたのは改行などの残したい記号があったりするので、ブラックリスト方式になっております。

実はVSCodeは設定を変更すれば『制御文字』を可視化できます。
ファイルサイズが小さければVSCodeで編集してしまえばいいのですが、さすがに 1GB を超えるファイルは開くには重過ぎました・・・。

スクリプトを書いてみた感想は、正規表現でサクッと探して除去するだけの簡単な処理かなと思っていましたがそんなことありませんでした。

参考記事

参考記事の執筆者の方、ありがとうございます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up