More than 1 year has passed since last update.

Pythonにおけるトークナイズ

Last updated at 2023-05-16Posted at 2023-05-16

PythonコードをPythonでトークナイズしたかった話．tokenizerライブラリの使い方をざっくりと触れます．特に tokenize.tokenize による基本的なトークナイズを見ていきます．

Environment

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

$ python3 --version
Python 3.11.3

Source code

簡単に解析してみたソースコードを載せます．ここでは test.py というPythonファイルを解析しています．

tokenizer.py

import tokenize

with open("test.py", "rb") as f:
    tokens = tokenize.tokenize(f.readline)

    for tnumber, tvalue, start, end, phisical_line in tokens:
        srow, scol = start
        erow, ecol = end
        print(f"<{srow},{scol}>-<{erow},{ecol}>\t`{tvalue}`({tnumber})\t{phisical_line}")

ポイントっぽいところ

ファイル読み込みについて，今回はバイナリで必要なので，open は rb で行います．
tokenize.tokenize の引数には，Fileの readline 関数を渡してあげます．
- fileオブジェクトのio.IOBase.readline() と同等のものが必要なので，文字列を直接渡したい場合は BytesIO(s.encode('utf-8')).readline などという感じになります．
forループはまだファイルを開いておく必要があるので，withの外には出せません．

返り値の形式

主に for tnumber, tvalue, start, end, phisical_line in tokens: の部分．

1値目(tnumber)：tokenのtypeを表す数値．一覧化されているところが見つからなかったけど，とりあえずcpythonのコードには見つけたのでそちらを参照．
- DEDENTあたりはこの数値を見てあげないと分からんかも．
2値目(tvalue)：tokenそのものの文字列．
3値目(start)：トークンの開始位置を表すタプル． (行, 列) の形式．
4値目(end)：トークンの終了位置を表すタプル．startと対になっているイメージ．
5値目(phisical_line)：リファレンスの言葉をそのまま持ってくると物理位置，となるけど，そのトークンを含む行の文字列がそのまま入るだけ．

Result

test.py（解析対象）

比較的シンプルめなソースコードを作成．

test.py

# test.py
i = int(input())
if i % 2 != 0:
    print("Odd")

Result

$ python3 tokenizer.py 
<0,0>-<0,0>     `utf-8`(63)
<1,0>-<1,9>     `# test.py`(61) # test.py

<1,9>-<1,10>    `
`(62)   # test.py

<2,0>-<2,1>     `i`(1)  i = int(input())

<2,2>-<2,3>     `=`(54) i = int(input())

<2,4>-<2,7>     `int`(1)        i = int(input())

<2,7>-<2,8>     `(`(54) i = int(input())

<2,8>-<2,13>    `input`(1)      i = int(input())

<2,13>-<2,14>   `(`(54) i = int(input())

<2,14>-<2,15>   `)`(54) i = int(input())

<2,15>-<2,16>   `)`(54) i = int(input())

<2,16>-<2,17>   `
`(4)    i = int(input())

<3,0>-<3,2>     `if`(1) if i % 2 != 0:

<3,3>-<3,4>     `i`(1)  if i % 2 != 0:

<3,5>-<3,6>     `%`(54) if i % 2 != 0:

<3,7>-<3,8>     `2`(2)  if i % 2 != 0:

<3,9>-<3,11>    `!=`(54)        if i % 2 != 0:

<3,12>-<3,13>   `0`(2)  if i % 2 != 0:

<3,13>-<3,14>   `:`(54) if i % 2 != 0:

<3,14>-<3,15>   `
`(4)    if i % 2 != 0:

<4,0>-<4,4>     `    `(5)           print("Odd")

<4,4>-<4,9>     `print`(1)          print("Odd")

<4,9>-<4,10>    `(`(54)     print("Odd")

<4,10>-<4,15>   `"Odd"`(3)          print("Odd")

<4,15>-<4,16>   `)`(54)     print("Odd")

<4,16>-<4,17>   `
`(4)        print("Odd")

<5,0>-<5,0>     ``(6)
<5,0>-<5,0>     ``(0)

応用

公式ライブラリの使用例に小数をDecimalに置換する例がありました．以下に引用します．

こんな感じで tokenize.untokenize を上手く併用すると，色々と応用が効きそうです．

untokenize

result = []
g = tokenize(BytesIO(s.encode('utf-8')).readline)  # tokenize the string
for toknum, tokval, _, _, _ in g:
    if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
        result.extend([
            (NAME, 'Decimal'),
            (OP, '('),
            (STRING, repr(tokval)),
            (OP, ')')
        ])
    else:
        result.append((toknum, tokval))
return untokenize(result).decode('utf-8')

まとめ

色々遊べそうですね．コメントとかもちゃんと持ってきてくれるのもこっそりありがたそう．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up