More than 1 year has passed since last update.

JavaScript、Node.js で文字列とバイト列の相互変換

Last updated at 2023-04-05Posted at 2017-09-28

概要

Encoding API を使って文字列とバイト列の相互変換に取り組みました。バイト列は Uint8Array であらわされ、Fetch API は Uint8Array の送受信をサポートします。
Node.js の場合、Buffer を使います。Encoding API を使いたい場合、ポリフィルを導入する必要があります。

文字列と Uint8Array の相互変換

TextEncoder は文字列を Uint8Array に変換します。デフォルトのエンコーディングは utf-8 です。

> (new TextEncoder).encode('あ')
Uint8Array(3) [227, 129, 130]
> (new TextEncoder('utf-8')).encode('あ')
Uint8Array(3) [227, 129, 130]
> (new TextEncoder).encoding
"utf-8"

16進数文字列を求めるには配列に変換します。

> Array.from((new TextEncoder('utf-8')).encode('あ')).map(v => v.toString(16))
> (3) ["e3", "81", "82"]

TextDecoder は Uint8Array を文字列に変換します。

> (new TextDecoder).decode(Uint8Array.of(0xe3, 0x81, 0x82))
"あ"
> (new TextDecoder).decode(Uint8Array.from([0xe3, 0x81, 0x82]))
> "あ"
> (new TextDecoder).decode(new Uint8Array([0xe3, 0x81, 0x82]))
"あ"

デフォルトのエンコーディングは utf-8 です。

> (new TextDecoder('utf-8')).decode(Uint8Array.of(0xe3, 0x81, 0x82))
"あ"

スプレッド演算子を使って Uint8Array.of に配列を渡すこともできます。

> (new TextDecoder).decode(Uint8Array.of(...[0xe3, 0x81, 0x82]))
"あ"

Uint8Array と16進数文字列の相互変換

Uint8Array を配列に変換して要素を連結させます。

> Uint8Array.of(0xe3, 0x81, 0x82)
Uint8Array(3) [227, 129, 130]
> Array.from(Uint8Array.of(0xe3, 0x81, 0x82)).map(v => v.toString(16)).join('')
"e38182"

今度は16進数文字列を Uint8Array に変換させてみましょう。

> str = 'e38182'
"e38182"
> arr = new Uint8Array(str.match(/.{1,2}/g).map(v => parseInt(v, 16)))
Uint8Array(3) [227, 129, 130]

Array.from を使うやり方もあります。

> str = 'e388182'
"e388182"
> Array.from({length: Math.ceil(str.length/2)}, (v, i) => str.substr(i * 2, 2))
(3) ["e3", "88", "18"]

ジェネレーターを使うやり方は次のとおりです。

> str = 'e38182'
'e38182'
> Array.from((function*(index, max, step) {
  while (index < max) {
    yield str.substr(index, step);
    index += step;
  }
})(0, str.length, 2));
[ 'e3', '81', '82' ]

バイトサイズ

Uint8Array.prototype.byteLength でバイトサイズを求めることができます。

> (new TextEncoder).encode('あ').byteLength
3

for-of ループで1バイトずつ数えてバイトサイズを求めることもできます。

> size = 0; for(v of (new TextEncoder).encode('あ')) { size++; }; size
3

不正なコードポイント列、バイト列の変換（2023年4月追加）

UTF-16 では U+10000 - U+10FFFF の範囲の文字の内部データは2つのコードポイントを組み合わせた「サロゲートペア」で表現されます。前側、後ろ側のコードポイントはそれぞれ「ハイサロゲート」（U+D800 - U+DBFF）、「ローサロゲート」（U+DC00 - U+DFFF）と呼びます。

問題は孤立したサロゲートの扱いです。サーバーやデータベースに送信する場合、受け取る側のプログラムが正常に扱えなかったり、攻撃手段に使われる可能性があるので、代替文字（U+FFFD）に置き換える必要があります。

まずは適当な文字がデータ形式として正しいのか（well-formed）確認してみましょう。

> "あ".isWellFormed()
true

次は孤立したハイサロゲートが不正であることをチェックします。

> "\u{D800}".isWellFormed()
false

今度はハイサロゲートを代替文字（U+FFFD）に置き換えます。

> "\u{D800}".toWellFormed()
'�'

何も考えずに孤立したサロゲートを削除すると脆弱性を作り出す可能性があります。

> "at\u{D800}tack".replace("\u{D800}", "")
'attack'

Node.js

text-encoding の導入

Node.js で Encoding API を使いたい場合、text-encoding パッケージを導入します。

yarn add text-encoding

CommonJS 形式でクラスを読み込むには次のように書きます。

const textEncoding = require('text-encoding');
const TextEncoder = textEncoding.TextEncoder;
const TextDecoder = textEncoding.TextDecoder;

ES2015 モジュール形式で読み込むには Node.js 実行時に --experimental-modules を指定します。2017年9月時点では次のように書く必要がありました。

import * as textEncoding from 'text-encoding';
const TextEncoder = textEncoding.default.TextEncoder;
const TextDecoder = textEncoding.default.TextDecoder;

文字列と Buffer、Uint8Array の相互変換

文字列を Buffer に変換するには from メソッドを使います。デフォルトのエンコーディングは utf-8 です。

> Buffer.from('あ')
<Buffer e3 81 82>
> Buffer.from('あ', 'utf-8')
<Buffer e3 81 82>

Buffer から文字列に変換するには toString を使います。

> Buffer.from('あ').toString()
'あ'
> Buffer.from('あ').toString('utf-8')
'あ'

ES2015 で導入された Uint8Array は Buffer を読み込むことができます。

> Uint8Array.from(Buffer.from('あ'))
Uint8Array [ 227, 129, 130 ]

逆に Buffer は Uint8Array を読み込むことができます。

> Buffer.from(Uint8Array.from(Buffer.from('あ'))).toString()
'あ'

Buffer と16進数文字列の相互変換

エンコーディングに 'hex' を指定します。

> Buffer.from('あ').toString('hex')
'e38182'
> Buffer.from([0xe3, 0x81, 0x82]).toString('hex')
'e38182'
> Buffer.from('e38182', 'hex')
<Buffer e3 81 82>

Buffer、Uint8Array と配列の相互変換

Buffer、Uint8Array から配列の変換には　Array.from を使うことができます。

> Buffer.from([0xe3, 0x81, 0x82])
<Buffer e3 81 82>
> Uint8Array.from([0xe3, 0x81, 0x82])
Uint8Array [ 227, 129, 130 ]
> Array.from(Buffer.from([0xe3, 0x81, 0x82]))
[ 227, 129, 130 ]
> Array.from(Uint8Array.from([0xe3, 0x81, 0x82]))
[ 227, 129, 130 ]

Array.from はイテレーター (values と entries)に対しても使うことができます。

> Array.from(Buffer.from([0xe3, 0x81, 0x82]).values())
[ 227, 129, 130 ]
> Array.from(Uint8Array.from([0xe3, 0x81, 0x82]).values())
[ 227, 129, 130 ]
> Array.from(Buffer.from([0xe3, 0x81, 0x82]).entries())
[ [ 0, 227 ], [ 1, 129 ], [ 2, 130 ] ]
> Array.from(Uint8Array.from([0xe3, 0x81, 0x82]).entries())
[ [ 0, 227 ], [ 1, 129 ], [ 2, 130 ] ]

... 演算子を使うこともできます。

> [...Buffer.from([0xe3, 0x81, 0x82]).values()]
[ 227, 129, 130 ]
> [...Buffer.from([0xe3, 0x81, 0x82]).entries()]
[ [ 0, 227 ], [ 1, 129 ], [ 2, 130 ] ]

バイトサイズ

> Buffer.byteLength('あいうえお')
15
> Buffer.byteLength('あいうえお', 'utf-8')
15

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up