More than 5 years have passed since last update.

コマンドプロンプト・プログラミング壮行会

Last updated at 2018-12-22Posted at 2018-12-17

CP932/UTF-8 相互変換

後半戦。UTF-8 -> CP932 変換。

UTF-8 をどうやって nkf に渡すか？

前回と同じ発想でいくと、
echo で文字列をパイプ経由にて nkf に渡すわけだが、
そのコマンドライン文字列を解釈するのは cmd.exe（たぶん）なので、
まずメタ文字を解釈されないために " は必須である。
また echo は改行以降を捨ててしまうのだった。

テキストファイルを type で出力する手もあるが、
出来ることなら遅いデバイスを経由したくはない。
はてさて。

echo に渡す文字列が、まともな文字列でなければいいわけだ。
何かしらのエンコード文字列を渡し、返ってきたものをデコードする。
エンコードは自前でなんとかするとして、
文字コード変換が面倒だから nkf を持ってきたのだから、
この何かしらのデコードもやってほしい。
nkf のオプションを調べてみると、候補は以下の３つとなった。

Base64
パーセントエンコーディング
文字参照

エンコードが楽なのは明らかに文字参照だ。

数値文字参照

ざっくり言うと、Unicode のコードポイントを

&#<10進数>;
&#x<16進数>; <- 今回はこっち

で表記する規格。

早速、エンコード関数を作ってみるが、
とりあえずは前回の cp932.lua に定義を追加する形で。

cp932.lua

-- 標準ライブラリ utf8 に関数を追加
function utf8.numchar(s)
  if s == "" then return s end
  local numchars = {}
  for _, code in utf8.codes(s) do
    table.insert(numchars, string.format("&#x%X;", code))
  end
  return table.concat(numchars, "")
end

lua.exe

> dofile "cp932.lua"
> utf8.numchar(cp932.utf8 "あいうえお")
&#x3042;&#x3044;&#x3046;&#x3048;&#x304A;
> utf8.numchar(cp932.utf8 "あ\nい\nう")
&#x3042;&#xA;&#x3044;&#xA;&#x3046;

これを nkf --numchar-input に渡せばデコードしてくれるらしい。

`utf8.cp932(string)`

cp932.lua

function utf8.cp932(s)
  if s == "" then return s end
  local numchar = utf8.numchar(s)
  -- '--numchar-input' での改行コード出力指定は無効っぽい
  local form = 'echo "%s"| nkf.exe --numchar-input --oc=CP932'
  local file = io.popen(string.format(form, numchar))
  local reply = file:read("a")
  file:close()
  -- 両端引用符/末尾 LF 除去、全改行を CRLF に変換
  -- n には gsub のパターン検出数が返ってくる
  local str, n = reply:sub(2, -3):gsub("%c", "\r\n")
  return str
end

lua.exe

> dofile "cp932.lua"
> win1 = [[
あ
い
]]
> table.each(win1:bytes(), print)
1       130
2       160
3       10
4       130
5       162
6       10
> unix = cp932.utf8(win1)
> table.each(unix:bytes(), print)
1       227
2       129
3       130
4       10
5       227
6       129
7       132
8       10
> win2 = utf8.cp932(unix)
> table.each(win2:bytes(), print)
1       130
2       160
3       13
4       10
5       130
6       162
7       13
8       10
> win2
あ
い

いーんじゃないかー。

`cp932.utf8(string)` の修正

前回の定義にはバグがある。
内部の echo %s が先と同じ問題を抱えてる。

lua.exe

> cp932.utf8 '"'
"| nkf.exe --ic=CP932 --oc=UTF-8 -Lu

cmd.exe がメタ文字を解釈しないように、
こちらも数値文字参照でエンコードしておこう。
しかしこっちは CP932 なので部分エンコードに留める。
また、入力行数分 nkf を呼んでいたのを、一度だけで済むように修正した。

cp932.lua

cp932 = {}

local function meta2numchar(line)
  return line:reduce("", function(acc, byte)
    -- cp932 は byte < 0x40 ならば置換して問題ない
    -- '"' と '%' のみエンコード
    if byte == 0x22 or byte == 0x25 then
      return acc .. string.format("&#x%X;", byte)
    end
    return acc .. string.char(byte)
  end)
end

local function win2unix(encoded)
  if s == "" then return s end
  local form =
    'echo "%s"| nkf.exe --numchar-input --ic=CP932 --oc=UTF-8 -Lu'
  local file = io.popen(string.format(form, encoded))
  local reply = file:read("a")
  file:close()
  return reply:sub(2, -3)
end

function cp932.utf8(s)
  if s == "" then return s end
  local numchars = {}
  for _, line in ipairs(s:lines()) do
    table.insert(numchars, meta2numchar(line))
  end
  return win2unix(table.concat(numchars, "&#x0A;"))
end

lua.exe

> dofile "cp932.lua"
> cp932.utf8 '"!path!%path%"'
"!path!%path%"
> utf8.cp932(cp932.utf8 '"!path!%path%"^()&|<>=')
"!path!%path%"^()&|<>=

まぁ、大丈夫かな。

閉会式

普通のプログラミングならば、この閉会の時点が開会になるだろう。
CP932/UTF-8 を処理できるコンパイラ言語で exe を吐いてしまえばいいのだから。

だが、私はこういう一見無駄な道のりの方が楽しい。
使われるよりも、使いたいのだ。

使う方法を知ったとして、それが使っていることになるだろうか？
欲しているものが何なのか定義できれば自ずと方法が生まれてくる。
そういうときこそ使っているような気がしてくる。

パソコンは創るための道具であってほしい。

「使うための道具」

どう考えてもおかしな文章でしょ。

完

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

コマンドプロンプト・プログラミング 壮行会

目次

CP932/UTF-8 相互変換

UTF-8 をどうやって nkf に渡すか？

数値文字参照

utf8.cp932(string)

cp932.utf8(string) の修正

閉会式

コマンドプロンプト・プログラミング壮行会

`utf8.cp932(string)`

`cp932.utf8(string)` の修正