More than 5 years have passed since last update.

ブロック化ハフマン符号を実装してみた（JavaScript）

Posted at 2019-01-08

勉強したので習作。計算効率は完全に無視。ハフマン符号がわかっている人向け。

参考: ハフマン符号を実装してみた（JavaScript）

1. 概念理解

解きたい課題（ハフマン符号と同じ）: N種類の値を取りうる事象があり、それぞれの値の生成確率がわかっている。できるだけ圧縮効率をあげて 0/1で構成されるビット列にエンコード/デコードしたい。

アプローチ: ハフマン符号は、「1事象を 1符号語に割り当てる」という枠組みで最大の圧縮効率。ブロック化ハフマン符号では、「N事象を 1符号語に割り当てる」という枠組みを使って、より圧縮率を高める。

イメージ図:

ABAACBBAC --[ブロック化]--> ABA ACB BAC --[エンコード]--> 010010010111000111000111
        --[デコード]--> ABA ACB BAC --[ブロック化解除]--> ABAACBBAC

2. 拡大情報源という概念

A が 90% の確率で生成され、 B が 10% の確率で生成される情報源を考える。二つの事象をブロック化したとすると、以下のような確率分布が得られる:

2次に拡大した情報源.txt


事象    生成確率
--------------------
AA      0.81
AB      0.09
BA      0.09
BB      0.01

これを「4種類の事象が上の表の生成確率で与えられる情報源」とみなすことが可能で、そうするとこの新たに作った情報源（拡大情報源）に対してハフマン符号を使ってみようという気になる。

同様に、三つの事象をブロック化すると、以下のような確率分布が得られる:

3次に拡大した情報源.txt

事象    生成確率
--------------------
AAA    0.729
AAB    0.081
...
(中略)
...
BBB    0.001

これを「8種類の事象が上の表の生成確率で与えられる情報源」とみなすことが可能で、そうするとこの新たに作った情報源（拡大情報源）に対してハフマン符号を使ってみようという気になる。

3. 拡大情報源の確率分布

やるべきことはただ一つ。上記のように、与えられた整数 N に対して、N次の拡大情報源の統計情報を自動的に作ること。

const DATA = [['A', 0.9], ['B', 0.1]];

function statsOf(stat, num) {
  const expandOne = (before) => {
    const result = [];
    for (const [id, prob] of before) {
      stat.forEach(([appendID, p]) => result.push([id + appendID, prob * p]));
    }
    return result;
  }
  // do work
  let result = stat;
  for (let i = 0; i < num - 1; i++) {
    result = expandOne(result);
  }
  return result;
}

console.log(statsOf(DATA, 2));
/*
[ [ 'AA', 0.81 ],
  [ 'AB', 0.09000000000000001 ],
  [ 'BA', 0.09000000000000001 ],
  [ 'BB', 0.010000000000000002 ] ]
*/

console.log(statsOf(DATA, 3));
/*
[ [ 'AAA', 0.7290000000000001 ],
  [ 'AAB', 0.08100000000000002 ],
  [ 'ABA', 0.08100000000000002 ],
  [ 'ABB', 0.009000000000000001 ],
  [ 'BAA', 0.08100000000000002 ],
  [ 'BAB', 0.009000000000000001 ],
  [ 'BBA', 0.009000000000000003 ],
  [ 'BBB', 0.0010000000000000002 ] ]
*/

4. エンコーダ・デコーダの実装

統計情報が得られれば、統計情報→ハフマン木→エンコーダ・デコーダとサクサク作っていける。ハフマン符号を実装してみた（JavaScript）を少しいじるだけ。

// ブロック化ハフマン符号の各種関数の生成関数
function blockHuffman (stat, n) {
  const stats = statsOf(stat, n);
  const tree = treeOf(stats);
  // encode
  const _encode = encoder(dictOf(tree));
  const encode = xs => {
    const splitted = xs.reduce((a, x, i) => {
      if (i % n === 0) { a.push(x) ; return a; }
      else { a[a.length - 1] += x; return a; }
    }, []); // [A, A, A, A] -> [AA, AA]
    return _encode(splitted);
  };
  // decode
  const _decode = decoder(tree);
  const decode = bits => _decode(bits).reduce((a, x) => a.concat([...x]), []);
  return { encode, decode, stats, tree, dict: dictOf(tree) };
}

const bh2 = blockHuffman(DATA, 2);
console.log(bh2.dict); // -> { AA: '0', BA: '100', BB: '101', AB: '11' }
console.log(bh2.encode([...'AAABBABB'])); // -> 011100101 
console.log(bh2.decode(bh2.encode([...'AAABBABB'])).join('')); // -> AAABBABB

5. 遊んでみる

１事象あたりの平均符号長が、N (=拡大の次数)を大きくしていくとどう変化していくかを見てみる。圧縮の理論限界＝情報源のエントロピー 0.4690 に近づいていくのがわかる。

const avLengths = [1, 2, 3, 4, 5, 6].map(n => {
  const { stats, dict } = blockHuffman(DATA, n);
  return stats.reduce((a, x) => a + x[1] * dict[x[0]].length, 0) / n;
});
console.log(avLengths);
/*
[ 1,
  0.6450000000000001,
  0.5326666666666667,
  0.49255000000000004,
  0.480194,
  0.4701568333333332 ]
*/

6 コード

// 確率分布からハフマン木を構築
function treeOf (stats) {
  const arr = stats.map(x => [...x]);
  while (arr.length !== 1) {
    arr.sort((a, b) => a[1] < b[1]); // sort by frequency
    const [a, b] = arr.splice(-2); // pop last 2 nodes
    arr.push([[a[0], b[0]], a[1] + b[1]]); // push 1 node
  }
  return arr[0][0];
}

// ハフマン木からエンコード用の辞書を作成
function dictOf(tree) {
  const result = {};
  const rfn = (obj, ctx) => {
    if (typeof obj === 'string') {
      result[obj] = ctx;
      return;
    }
    rfn(obj[0], ctx + '0');
    rfn(obj[1], ctx + '1');
  };
  rfn(tree, '');
  return result;
}

// encode 関数の生成関数。辞書を渡すと encode 関数を返す
const encoder = dict => ids => ids.map(id => dict[id]).join('');

// decode 関数の生成関数。ハフマン木を渡すと decode 関数を返す
const decoder = tree => bits => {
  let current = tree;
  // define 1-bit decoder
  const _decode = bit => {
    current = bit === '0' ? current[0] : current[1];
    if (typeof current === 'string') { // decoded!
      const result = current;
      current = tree;
      return result;
    }
    return null; // needs more to be decoded
  };
  // do the work
  const result = [];
  for (const bit of bits) {
    const val = _decode(bit);
    if (val) {
      result.push(val);
    }
  }
  return result;
};

// num 次に拡大した情報源の確率分布を計算
function statsOf(stat, num) {
  const expandOne = (before) => {
    const result = [];
    for (const [id, prob] of before) {
      stat.forEach(([appendID, p]) => result.push([id + appendID, prob * p]));
    }
    return result;
  }
  // do work
  let result = stat;
  for (let i = 0; i < num - 1; i++) {
    result = expandOne(result);
  }
  return result;
}

// ブロック化ハフマン符号の各種関数の生成関数
function blockHuffman (stat, n) {
  const stats = statsOf(stat, n);
  const tree = treeOf(stats);
  // encode
  const _encode = encoder(dictOf(tree)); // internal encode
  const encode = xs => {
    const splitted = xs.reduce((a, x, i) => {
      if (i % n === 0) { a.push(x) ; return a; }
      else { a[a.length - 1] += x; return a; }
    }, []); // [A, A, A, A] -> [AA, AA]
    return _encode(splitted);
  };
  // decode
  const _decode = decoder(tree);
  const decode = bits => _decode(bits).reduce((a, x) => a.concat([...x]), []);
  return { encode, decode, stats, tree, dict: dictOf(tree) };
}

// 事象生成ジェネレーター。確率分布に従って事象列を生成
const observe = num => {
  const result = [];
  [...Array(num).keys()].forEach(() => { // num times
    const rval = Math.random();
    const total = DATA.reduce((a, x) => a + x[1], 0);
    let acc = 0;
    for (const [id, freq] of DATA) {
      acc += freq / total;
      if (acc > rval) {
        result.push(id);
        return;
      }
    }
  });
  return result;
};


const DATA = [['A', 0.9], ['B', 0.1]];
const avLengths = [1, 2, 3, 4, 5, 6].map(n => {
  const { stats, dict } = blockHuffman(DATA, n);
  return stats.reduce((a, x) => a + x[1] * dict[x[0]].length, 0) / n;
});
console.log(avLengths);

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up