More than 3 years have passed since last update.

小道具：Unicode のサロゲートコードを得る

Last updated at 2022-02-11Posted at 2022-02-11

サロゲートコードを直接記述したくなったときの小道具です。

Unicode 文字列の一文字のビット幅が 16 （例： JavaScript, MSVC の wchar_t など）では、コードポイントが 65536 を超える文字を直接格納できないので、サロゲートコードとして格納することになります。

surrogatefy.py

# !/usr/bin/env python3

import argparse


def surrogatefy(code):
    return [code] if code < 0x10000 else [(0xd800 | ((code >> 10) - 64)), (0xdc00 | (code & 0x03ff))]


def print_code(msg, prefix, code, separator):
    print('\'%s\':0x%x -> %s' % (msg, code, separator.join(['%s%04x' % (prefix, c) for c in surrogatefy(code)])))


parser = argparse.ArgumentParser()
parser.add_argument('-c', '--code')
parser.add_argument('-m', '--message')
parser.add_argument('-p', '--prefix', default='0x')
parser.add_argument('-s', '--separator', default=',')
args = parser.parse_args()

if args.code or args.message:
    if args.code:
        print_code(args.code, args.prefix, int(args.code, 0), args.separator)
    if args.message:
        for c in args.message:
            print_code(c, args.prefix, ord(c), args.separator)
else:
    parser.print_help()

実行結果

$ python3 surrogatefy.py -m '🃏🇯🇵'
python3 surrogatefy.py -m 🃏🇯🇵
'🃏':0x1f0cf -> 0xd83c,0xdccf
'🇯':0x1f1ef -> 0xd83c,0xddef
'🇵':0x1f1f5 -> 0xd83c,0xddf5
$ python3 surrogatefy.py -c 0x1f0cf
'0x1f0cf':0x1f0cf -> 0xd83c,0xdccf

node.js でのテスト

$ node
Welcome to Node.js v16.13.1.
Type ".help" for more information.
> s = String.fromCodePoint(0x1f0cf, 0x1f1ef, 0x1f1f5)
'🃏🇯🇵'
> s.length
6
> [s.charCodeAt(0).toString(16), s.charCodeAt(1).toString(16)]
[ 'd83c', 'dccf' ]
> [s.charCodeAt(2).toString(16), s.charCodeAt(3).toString(16)]
[ 'd83c', 'ddef' ]
> [s.charCodeAt(4).toString(16), s.charCodeAt(5).toString(16)]
[ 'd83c', 'ddf5' ]
> s.codePointAt(0).toString(16)
'1f0cf'
> s.codePointAt(2).toString(16)
'1f1ef'
> s.codePointAt(4).toString(16)
'1f1f5'
> for (const c of s) console.log(c, c.codePointAt(0).toString(16));
🃏 1f0cf
🇯 1f1ef
🇵 1f1f5
undefined
> '\u{1f0cf},\ud83c\udccf,\u{1f1ef}\u{1f1f5},\ud83c\uddef\ud83c\uddf5'
'🃏,🃏,🇯🇵,🇯🇵'
> '🇯🇵'.length
4

日章旗 🇯🇵 はコード 0x1f1ef と 0x1f1f5 の組み合わせなので、('🇯🇵'.length)は 4 になります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

小道具：Unicode のサロゲート コードを得る

小道具：Unicode のサロゲートコードを得る