More than 1 year has passed since last update.

Base 64 を推測で実装してみる

Last updated at 2024-12-04Posted at 2021-12-13

バイナリデータをテキストデータに変換して扱うには？

人が接する多くのソフトウェアがテキスト形式のデータを採用しています。それらのソフトウェア（ S ）にバイト列のデータやファイルを送ると、入出力では制御コードだったりして都合の悪い事が発生します。そこで、バイト列のデータ（ B ）を（ S ）でも扱える形式が欲しくなり、（ B ）を（ S ）で扱えるように変換したテキスト形式（ A ）を考えます。とりあえず

　　数字： 0 から 9 までの 10 種類
　　アルファベット： A から Z までの 26 種類

を使うことを考えます。36 進数までは可能なので、32 種類の文字を選び、5ビット区切り（32進数）のビット列とみなした形式に変換できます。アルファベットの大文字と小文字を区別しない環境（小文字を自動的に大文字に変換する MS-DOS のファイル名やテキストフィルタなど）では、小(または大)文字の使用が躊躇われます。そうでない（アルファベットの大文字と小文字を区別する＝余計な処理をしない）環境では

　　アルファベット： A から Z および a から z までの 52 種類

を使うことができます。文字コード順に並べると 0〜9：A〜Z：a〜z の計 62 種類となりますが、6ビット区切り（64進数）の形式にするには 2 種類の文字が足りません。残り 2 文字は記号文字になるので、どれにするかが悩ましいところです。

Base 64 (RFC 4648) の表を見る

RFC 4648 には採用された文字の一覧があるだろうから、その表を探します。Page 6 に一覧がありました。（文章は面倒臭いので読みません ← ちゃんと読みましょう？）

値	文字	値	文字	値	文字	値	文字	値	文字
0	A	13	N	26	a	39	n	52	0
1	B	14	O	27	b	40	o	53	1
2	C	15	P	28	c	41	p	54	2
3	D	16	Q	29	d	42	q	55	3
4	E	17	R	30	e	43	r	56	4
5	F	18	S	31	f	44	s	57	5
6	G	19	T	32	g	45	t	58	6
7	H	20	U	33	h	46	u	59	7
8	I	21	V	34	i	47	v	60	8
9	J	22	W	35	j	48	w	61	9
10	K	23	X	36	k	49	x	62	+
11	L	24	Y	37	l	50	y	63	/
12	M	25	Z	38	m	51	z	(pad)	=

値と文字の割り当ては分かりました。(pad) が謎ですが、とりあえず単純な変換を考えることにします。

ビット列を 6 ビット区切りにする

この手の規格は、ネットワークバイトオーダーを前提に考えられていることが多いので、ビッグエンディアンと仮定します。間違っていたら、リトルエンディアンを試すことにします。

ASCII 文字列 "Base64" の変換を試みます。文字の値とビット列は

文字	B	a	s	e	6	4
16進	42	61	73	65	36	34
2進	01000010	01100001	01110011	01100101	00110110	00110100

となっていて、これを 6 ビット区切りにし、上記の表を使って変換すると

文字	Q	m	F	z	Z	T	Y	0
2進	010000	100110	000101	110011	011001	010011	011000	110100
16進	10	26	05	33	19	13	18	34
10進	16	38	5	51	25	19	24	52

文字列 "QmFzZTY0" が得られました。macOS 12.0.1 のターミナル上で base64 コマンドを使って確認すると

ターミナル

$ echo -n "Base64" | base64
QmFzZTY0
$ echo -n "QmFzZTY0" | base64 -d
Base64

となり、"Base64" の変換はうまくいきました。ビット並びはビッグエンディアンでよさそうです。

Base64 形式への変換は、3 バイトから 4 文字へ、逆変換は 4 文字から 3 バイトになるようです。文字列 "Base64" は、たまたま 6 バイトだったので、3 で割り切れるバイト数でしたが、割り切れないバイト数だった場合はどうなるのでしょう？

バイト数が 3 の倍数でない場合

ASCII 文字列 "Test" （4バイト）を考えてまみす。 "Base64" のときと同様に

| 文字 | T | e | s | t | ? | ? |
|--:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| 16進 | 54 | 65 | 73 | 74 | ? | ? |
| 2進 | 01010100 | 01100101 | 01110011 | 01110100 | ???????? | ???????? |

として、不明な部分 "?" を "0" に置き換えて変換してみると

文字	V	G	V	z	d	A	A	A
2進	010101	000110	010101	110011	011101	000000	000000	000000
16進	15	06	15	33	1D	00	00	00
10進	21	6	21	51	29	0	0	0

文字列 "VGVzdAAA" となりました。これをそのまま元に戻すと "Test" に NULL(00) バイトが 2 つ追加されてしまうでしょう。そこで、バイト数を 3 で割ったときの余りの表現を考えます。

	余り 1	余り 2	余りなし
"t" の 2進	01110100	????????	????????
"t1" の 2進	01110100	00110001	????????

これを 6 ビット区切りにすると

	1 文字目	2 文字目	3 文字目	4 文字目
"t"	011101	00????	??????	??????
16進	1D	00	??	??
10進	29	0	?	?
文字	d	A	?	?
"t1"	011101	000011	0001??	??????
16進	1D	03	04	??
10進	29	3	4	?
文字	d	D	E	?

となるので、余り 1 バイトでは、末尾 2 文字に何か、余り 2 バイトでは、末尾 1 文字に何かを詰めれば良さそうです。詰め物には (pad:=) が指定されていするようです。これで、(pad) の謎が解けました。また、ビットの "?" は "0" とするのが順当でしょう。

ASCII 文字列 "Test" と "Test1" の変換は

　　"Test" → "VGVzdA=="
　　"Test1" → "VGVzdDE="

となります。 base64 コマンドで確認すると

ターミナル

$ echo -n "Test" | base64
VGVzdA==
$ echo -n "VGVzdA==" | base64 -d
Test
$ echo -n "Test1" | base64
VGVzdDE=
$ echo -n "VGVzdDE=" | base64 -d
Test1

となって、(pad:=) の考え方も間違っていないようです。これで基本的な変換方法としては、これ以上もこれ以下もないでしょうから、ここまでの推測でプログラムが書けそうです。

符号・復号プログラム

処理は基本的な変換のみです。

コマンドラインで使用する場合、CRLF 問題が発生することがあります。
(Windows のコマンドプロンプトなど)

Python 版

mybase64.py

#!/usr/bin/env python3

ENCODE_TABLE = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/='
DECODE_TABLE = [((chr(n) not in ENCODE_TABLE) and -1 or ENCODE_TABLE.index(chr(n))) for n in range(256)]
DECODE_TABLE[0x2d] = 62  # '-'
DECODE_TABLE[0x5f] = 63  # '_'


def encode(data):
    padding = -len(data) % 3
    padded = data + (b'\x00' * padding)
    cells = [int.from_bytes(padded[dpos:dpos+3], 'big') for dpos in range(0, len(padded), 3)]
    base64 = ''.join([ENCODE_TABLE[(cell >> shift) & 0x3f] for cell in cells for shift in (18, 12, 6, 0)])
    return (base64 if padding == 0 else (base64[:-padding] + ('=' * padding)))


def decode(data):
    if len(data) == 0:
        return b''
    ndata = tuple(filter((lambda d: (d >= 0)), [DECODE_TABLE[ord(c)] for c in data]))
    cells = [ndata[dpos:dpos+4] for dpos in range(0, len(ndata), 4)]
    binary = b''.join([sum([(cell[i] & 63) << ((i ^ 3) * 6) for i in range(4)]).to_bytes(3, 'big') for cell in cells])
    padding = sum([(c >> 6) for c in ndata[-2:]])
    return (binary if padding == 0 else binary[:-padding])


if __name__ == '__main__':
    import os
    import sys

    def usage():
        print('Usage: %s [-d]' % program)
        sys.exit(1)

    def read_binary(): return sys.stdin.buffer.read()
    def read_text(): return sys.stdin.read()
    def write_binary(s): sys.stdout.buffer.write(s)
    def write_text(s): sys.stdout.write(s)

    argv = sys.argv
    program = argv.pop(0)
    encode = True
    if len(argv) == 1:
        if argv[0] == '-d':
            encode = False
        else:
            usage()
    elif len(argv) != 0:
        usage()

    if encode:
        write_text(encode(read_binary()) + '\n')
    else:
        write_binary(decode(read_text()))

C 版

mybase64.c

#include <stdio.h>

/*
 * Base64 encode/decode
 */

static int encode()
{
    const char table[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";
    FILE *inp = stdin;
    FILE *out = stdout;
    int data, c, clen;
    int e0, e1, e2, e3;

    for (;;)
    {
        data = 0;
        for (clen = 0; clen < 3; clen++)
        {
            c = fgetc(inp);
            if (c == EOF)
                break;
            data = (data << 8) | (c & 0xff);
        }
        if (clen != 3)
            break;

        e0 = table[(data >> 18) & 0x3f];
        e1 = table[(data >> 12) & 0x3f];
        e2 = table[(data >>  6) & 0x3f];
        e3 = table[(data >>  0) & 0x3f];

        if (fputc(e0, out) == EOF) return 2;
        if (fputc(e1, out) == EOF) return 2;
        if (fputc(e2, out) == EOF) return 2;
        if (fputc(e3, out) == EOF) return 2;

    }
    if (clen == 0)
        return 0;
    if (clen < 3)
        data <<= ((3 - clen) << 3);

    e0 = table[(data >> 18) & 0x3f];
    e1 = table[(data >> 12) & 0x3f];
    e2 = (clen == 1) ? '=' : table[(data >>  6) & 0x3f];
    e3 = '=';

    if (fputc(e0, out) == EOF) return 2;
    if (fputc(e1, out) == EOF) return 2;
    if (fputc(e2, out) == EOF) return 2;
    if (fputc(e3, out) == EOF) return 2;

    return 0;
}

static int decode()
{
    static char table[] = {
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63,
        52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1,  0, -1, -1,
        -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
        15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1,
        -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
        41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    };

    FILE *inp = stdin;
    FILE *out = stdout;
    int data, d, e, c, clen;
    int c0, c1, c2;

    for (;;)
    {
        d = e = data = 0;
        for (clen = 0; clen < 4;)
        {
            c = fgetc(inp);
            if (c == EOF)
                break;
            d = table[c & 0xff];
            if (d < 0)
                continue;
            e += ((c == '=') ? 1 : 0);
            data = (data << 6) | d;
            clen++;
        }
        if (clen < 4)
            return (clen == 0 ? 0 : 2);

        c0 = (data >> 16) & 0xff;
        c1 = (data >>  8) & 0xff;
        c2 = (data >>  0) & 0xff;

        if (fputc(c0, out) == EOF) return 2; if (e == 2) continue;
        if (fputc(c1, out) == EOF) return 2; if (e == 1) continue;
        if (fputc(c2, out) == EOF) return 2;
    }

    return 0;
}

/*
 * flags and options
 */
static const char *program = NULL;
static int flag_decode = 0;

/*
 * main / usage
 */
static int usage()
{
    printf("Usage: %s [options]\n"
           "\n"
           "options are:\n"
           "    -d    decode mode.\n"
           "    -e    encode mode. (default)\n"
           "\n"
           , program);
    return 1;
}

static char *shift_arg(int *argc, char ***argv)
{
    char *arg = NULL;

    if ((*argc) > 0)
    {
        arg = *(*argv);
        --(*argc);
        ++(*argv);
    }
    return arg;
}

#define shift()  (shift_arg(&argc, &argv))

int main(int argc, char **argv)
{
    int new_argc = 0;
    char **new_argv = argv;
    char *arg;
    int s_opt;

    program = shift();
    while (argc > 0)
    {
        arg = shift();
        if ((arg[0] != '-') || (arg[1] == 0))
        {
            new_argv[new_argc++] = arg;
            continue;
        }

        while ((s_opt = *(++arg)) != 0)
        {
            switch (s_opt)
            {
            case 'd':
                flag_decode = !0;
                continue;
            case 'e':
                flag_decode = !!0;
                continue;
            default:
                return usage();
            }
        }
    }

    argc = new_argc;
    argv = new_argv;

    if (argc != 0) return usage();
    return (flag_decode ? decode() : encode());
}

JavaScript 版

mybase64.html

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
    <title>Base64 テスト</title>
  </head>
  <body>
    <h1>Base64 テスト</h1>
    <hr>
    <div>テスト文字の入力領域：</div>
    <blockquote>
      <textarea id="inp" rows="10" cols="80" oninput="onInput()"></textarea>
    </blockquote>
    <table>
      <tr><td>符号化</td><td>：</td><td id="enc"></td></tr>
      <tr><td>復号</td><td>：</td><td id="dec"></td></tr>
    </table>
    <div>ブラウザによる復号：</div>
    <blockquote>
      <iframe id="chk"></iframe>
    </blockquote>
    <hr>
    <script type="text/javascript">
    <!--

      function String2UTF8ByteString(s) {
          let bs = [];
          for (let i = 0; i < s.length; i++) {
              let c = s.charCodeAt(i);
              if (c < 0)
                  throw ('unknown char: ' + c)
              if (c < 0x80) {
                  bs.push(c);
                  continue;
              }
              if (c < 0x800) {
                  bs.push(0xc0 | (c >> 6));
                  bs.push(0x80 | (c & 0x3f));
                  continue;
              }
              if (c < 0x10000) {
                  bs.push(0xe0 | (c >> 12));
                  bs.push(0x80 | ((c >> 6) & 0x3f));
                  bs.push(0x80 | (c & 0x3f));
                  continue;
              }
              throw ('unknown char: ' + c)
          }
          return bs;
      }

      function UTF8ByteString2String(b) {
          let s = '';
          for (let i = 0; i < b.length; i++) {
              let d0 = b[i];
              if (d0 < 0xc0) {
                  s += String.fromCharCode(d0);
                  continue;
              }
              let d1 = b[++i];
              if (d0 < 0xe0) {
                  s += String.fromCharCode(((d0 & 0x3f) << 6) | (d1 & 0x3f));
                  continue;
              }
              let d2 = b[++i];
              if (d0 < 0xf0) {
                  s += String.fromCharCode(((d0 & 0x1f) << 12) | ((d1 & 0x3f) << 6) | (d2 & 0x3f));
                  continue;
              }
              throw ('unknown utf8...');
          }
          return s;
      }

      const EncodeTable = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=';
      const DecodeTable = new Map();
      for (let i = 0; i < EncodeTable.length; i++)
          DecodeTable.set(EncodeTable[i], (i & 0x3f));

      function encodeBase64(b) {
          let nb = b.length;
          let rb = nb % 3;
          let lb = nb - rb;
          let s = '';
          for (let i = 0; i < lb; i += 3) {
              let d0 = b[i];
              let d1 = b[i + 1];
              let d2 = b[i + 2];
              let d = (d0 << 16) | (d1 << 8) | d2;
              s += EncodeTable[(d >> 18) & 0x3f];
              s += EncodeTable[(d >> 12) & 0x3f];
              s += EncodeTable[(d >>  6) & 0x3f];
              s += EncodeTable[d & 0x3f];
          }
          if (rb != 0) {
              let d0 = b[lb];
              let d1 = ((rb > 1) ? b[lb + 1] : 0);
              let d = (d0 << 16) | (d1 << 8);
              s += EncodeTable[(d >> 18) & 0x3f];
              s += EncodeTable[(d >> 12) & 0x3f];
              if (rb == 2)
                  s += EncodeTable[(d >>  6) & 0x3f];
              else
                  s += '=';
              s += '=';
          }
          return s;
      }

      function decodeBase64(s) {
          let table = DecodeTable;
          let t = [];
          for (let c of s) {
              let v = table.get(c);
              if (v != undefined)
                  t.push(v);
          }
          let ns = t.length;
          let b = []
          for (let i = 0; i < ns; i += 4) {
              let d0 = t[i];
              let d1 = t[i + 1];
              let d2 = t[i + 2];
              let d3 = t[i + 3];
              let d = (d0 << 18) | (d1 << 12) | (d2 << 6) | d3;
              b.push((d >> 16) & 0xff);
              b.push((d >> 8) & 0xff);
              b.push(d & 0xff);
          }
          if (s[-1] == '=') {
              b.pop()
              if (s[-2] == '=')
                  b.pop();
          }
          return b;
      }

      function onInput() {
          let inp = document.getElementById("inp");
          let enc = document.getElementById("enc");
          let dec = document.getElementById("dec");
          let chk = document.getElementById("chk");

          let s = inp.value;
          let u = String2UTF8ByteString(s);
          let e = encodeBase64(u);
          let d = decodeBase64(e);
          let t = UTF8ByteString2String(d);

          chk.setAttribute('src', 'data:text/plain;charset=utf-8;base64,' + e)
          enc.innerText = e;
          dec.innerText = t;
      }

      // -->
    </script>
  </body>
</html>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up