것 = Hangul
など、文字列から language を detect したい
文字列内で使われている文字の種類(ハングル文字、アラビア文字、タイ文字、etc)を、スパム検知などの成分、つまり機械学習の特徴の1つとして使いたい。
言語の検知はできなくても、少なくとも文字列内で使われている文字がハングル文字・ひらがな・カタカナ・etc なのかを取得できないか。
TL;DR(ユニコード文字データベース、文字コード割り当て表)
Unicode には、各言語の文字(キャラクター)のブロックごとに名前が付いるので、ブロック名を取得する。
- Unicode Character Database のブロック対応表:
- https://www.unicode.org/Public/UNIDATA/Blocks.txt @ Unicode.org
つまりブロックに付いた名前から、文字列がどのブロック(範囲)のキャラクターを使っているかを返し、言語検知の成分の1つとして利用してはどうか。
- ブロックは Unicode.org で制定されており対応表がある。この対応表は
Unicode Character Database
と呼ばれる。 - 該当する(UTF-8 の)文字からコードポイントを調べ、この対応表を元に該当する範囲のブロック名を取得すれば、何語圏の文字であるか検知できる。(「コードポイント」に関しては TS;DR をご覧ください)
🐒 注意: この記事は文字種の検知の実装だけで、最終的な言語検知の実装まではしていません。
【コピペ用】Unicode Character Database 一覧 in JSON
キーと値の関係は { "開始コードポイント(START ADDRESS)": "対応するブロック名" }
です。ブロックの終了コードポイント(END ADDRESS
)は、次のブロックの開始コードポイント(START ADDRESS
)から -1
したものです。
例えば "Basic Latin
" の場合は、次の "Latin-1 Supplement
" が "0080" スタートなので "0x0080 - 0x1" = "0x007F" が END
になります。つまり "Basic Latin
" のコードポイントの範囲は "U+0000" 〜 "U+007F" になります。
{
"0000": "Basic Latin",
"0080": "Latin-1 Supplement",
"0100": "Latin Extended-A",
"0180": "Latin Extended-B",
"0250": "IPA Extensions",
"02B0": "Spacing Modifier Letters",
"0300": "Combining Diacritical Marks",
"0370": "Greek and Coptic",
"0400": "Cyrillic",
"0500": "Cyrillic Supplement",
"0530": "Armenian",
"0590": "Hebrew",
"0600": "Arabic",
"0700": "Syriac",
"0750": "Arabic Supplement",
"0780": "Thaana",
"07C0": "NKo",
"0800": "Samaritan",
"0840": "Mandaic",
"0860": "Syriac Supplement",
"08A0": "Arabic Extended-A",
"0900": "Devanagari",
"0980": "Bengali",
"0A00": "Gurmukhi",
"0A80": "Gujarati",
"0B00": "Oriya",
"0B80": "Tamil",
"0C00": "Telugu",
"0C80": "Kannada",
"0D00": "Malayalam",
"0D80": "Sinhala",
"0E00": "Thai",
"0E80": "Lao",
"0F00": "Tibetan",
"1000": "Myanmar",
"10A0": "Georgian",
"1100": "Hangul Jamo",
"1200": "Ethiopic",
"1380": "Ethiopic Supplement",
"13A0": "Cherokee",
"1400": "Unified Canadian Aboriginal Syllabics",
"1680": "Ogham",
"16A0": "Runic",
"1700": "Tagalog",
"1720": "Hanunoo",
"1740": "Buhid",
"1760": "Tagbanwa",
"1780": "Khmer",
"1800": "Mongolian",
"18B0": "Unified Canadian Aboriginal Syllabics Extended",
"1900": "Limbu",
"1950": "Tai Le",
"1980": "New Tai Lue",
"19E0": "Khmer Symbols",
"1A00": "Buginese",
"1A20": "Tai Tham",
"1AB0": "Combining Diacritical Marks Extended",
"1B00": "Balinese",
"1B80": "Sundanese",
"1BC0": "Batak",
"1C00": "Lepcha",
"1C50": "Ol Chiki",
"1C80": "Cyrillic Extended-C",
"1C90": "Georgian Extended",
"1CC0": "Sundanese Supplement",
"1CD0": "Vedic Extensions",
"1D00": "Phonetic Extensions",
"1D80": "Phonetic Extensions Supplement",
"1DC0": "Combining Diacritical Marks Supplement",
"1E00": "Latin Extended Additional",
"1F00": "Greek Extended",
"2000": "General Punctuation",
"2070": "Superscripts and Subscripts",
"20A0": "Currency Symbols",
"20D0": "Combining Diacritical Marks for Symbols",
"2100": "Letterlike Symbols",
"2150": "Number Forms",
"2190": "Arrows",
"2200": "Mathematical Operators",
"2300": "Miscellaneous Technical",
"2400": "Control Pictures",
"2440": "Optical Character Recognition",
"2460": "Enclosed Alphanumerics",
"2500": "Box Drawing",
"2580": "Block Elements",
"25A0": "Geometric Shapes",
"2600": "Miscellaneous Symbols",
"2700": "Dingbats",
"27C0": "Miscellaneous Mathematical Symbols-A",
"27F0": "Supplemental Arrows-A",
"2800": "Braille Patterns",
"2900": "Supplemental Arrows-B",
"2980": "Miscellaneous Mathematical Symbols-B",
"2A00": "Supplemental Mathematical Operators",
"2B00": "Miscellaneous Symbols and Arrows",
"2C00": "Glagolitic",
"2C60": "Latin Extended-C",
"2C80": "Coptic",
"2D00": "Georgian Supplement",
"2D30": "Tifinagh",
"2D80": "Ethiopic Extended",
"2DE0": "Cyrillic Extended-A",
"2E00": "Supplemental Punctuation",
"2E80": "CJK Radicals Supplement",
"2F00": "Kangxi Radicals",
"2FF0": "Ideographic Description Characters",
"3000": "CJK Symbols and Punctuation",
"3040": "Hiragana",
"30A0": "Katakana",
"3100": "Bopomofo",
"3130": "Hangul Compatibility Jamo",
"3190": "Kanbun",
"31A0": "Bopomofo Extended",
"31C0": "CJK Strokes",
"31F0": "Katakana Phonetic Extensions",
"3200": "Enclosed CJK Letters and Months",
"3300": "CJK Compatibility",
"3400": "CJK Unified Ideographs Extension A",
"4DC0": "Yijing Hexagram Symbols",
"4E00": "CJK Unified Ideographs",
"A000": "Yi Syllables",
"A490": "Yi Radicals",
"A4D0": "Lisu",
"A500": "Vai",
"A640": "Cyrillic Extended-B",
"A6A0": "Bamum",
"A700": "Modifier Tone Letters",
"A720": "Latin Extended-D",
"A800": "Syloti Nagri",
"A830": "Common Indic Number Forms",
"A840": "Phags-pa",
"A880": "Saurashtra",
"A8E0": "Devanagari Extended",
"A900": "Kayah Li",
"A930": "Rejang",
"A960": "Hangul Jamo Extended-A",
"A980": "Javanese",
"A9E0": "Myanmar Extended-B",
"AA00": "Cham",
"AA60": "Myanmar Extended-A",
"AA80": "Tai Viet",
"AAE0": "Meetei Mayek Extensions",
"AB00": "Ethiopic Extended-A",
"AB30": "Latin Extended-E",
"AB70": "Cherokee Supplement",
"ABC0": "Meetei Mayek",
"AC00": "Hangul Syllables",
"D7B0": "Hangul Jamo Extended-B",
"D800": "High Surrogates",
"DB80": "High Private Use Surrogates",
"DC00": "Low Surrogates",
"E000": "Private Use Area",
"F900": "CJK Compatibility Ideographs",
"FB00": "Alphabetic Presentation Forms",
"FB50": "Arabic Presentation Forms-A",
"FE00": "Variation Selectors",
"FE10": "Vertical Forms",
"FE20": "Combining Half Marks",
"FE30": "CJK Compatibility Forms",
"FE50": "Small Form Variants",
"FE70": "Arabic Presentation Forms-B",
"FF00": "Halfwidth and Fullwidth Forms",
"FFF0": "Specials",
"10000": "Linear B Syllabary",
"10080": "Linear B Ideograms",
"10100": "Aegean Numbers",
"10140": "Ancient Greek Numbers",
"10190": "Ancient Symbols",
"101D0": "Phaistos Disc",
"10280": "Lycian",
"102A0": "Carian",
"102E0": "Coptic Epact Numbers",
"10300": "Old Italic",
"10330": "Gothic",
"10350": "Old Permic",
"10380": "Ugaritic",
"103A0": "Old Persian",
"10400": "Deseret",
"10450": "Shavian",
"10480": "Osmanya",
"104B0": "Osage",
"10500": "Elbasan",
"10530": "Caucasian Albanian",
"10600": "Linear A",
"10800": "Cypriot Syllabary",
"10840": "Imperial Aramaic",
"10860": "Palmyrene",
"10880": "Nabataean",
"108E0": "Hatran",
"10900": "Phoenician",
"10920": "Lydian",
"10980": "Meroitic Hieroglyphs",
"109A0": "Meroitic Cursive",
"10A00": "Kharoshthi",
"10A60": "Old South Arabian",
"10A80": "Old North Arabian",
"10AC0": "Manichaean",
"10B00": "Avestan",
"10B40": "Inscriptional Parthian",
"10B60": "Inscriptional Pahlavi",
"10B80": "Psalter Pahlavi",
"10C00": "Old Turkic",
"10C80": "Old Hungarian",
"10D00": "Hanifi Rohingya",
"10E60": "Rumi Numeral Symbols",
"10F00": "Old Sogdian",
"10F30": "Sogdian",
"10FE0": "Elymaic",
"11000": "Brahmi",
"11080": "Kaithi",
"110D0": "Sora Sompeng",
"11100": "Chakma",
"11150": "Mahajani",
"11180": "Sharada",
"111E0": "Sinhala Archaic Numbers",
"11200": "Khojki",
"11280": "Multani",
"112B0": "Khudawadi",
"11300": "Grantha",
"11400": "Newa",
"11480": "Tirhuta",
"11580": "Siddham",
"11600": "Modi",
"11660": "Mongolian Supplement",
"11680": "Takri",
"11700": "Ahom",
"11800": "Dogra",
"118A0": "Warang Citi",
"119A0": "Nandinagari",
"11A00": "Zanabazar Square",
"11A50": "Soyombo",
"11AC0": "Pau Cin Hau",
"11C00": "Bhaiksuki",
"11C70": "Marchen",
"11D00": "Masaram Gondi",
"11D60": "Gunjala Gondi",
"11EE0": "Makasar",
"11FC0": "Tamil Supplement",
"12000": "Cuneiform",
"12400": "Cuneiform Numbers and Punctuation",
"12480": "Early Dynastic Cuneiform",
"13000": "Egyptian Hieroglyphs",
"13430": "Egyptian Hieroglyph Format Controls",
"14400": "Anatolian Hieroglyphs",
"16800": "Bamum Supplement",
"16A40": "Mro",
"16AD0": "Bassa Vah",
"16B00": "Pahawh Hmong",
"16E40": "Medefaidrin",
"16F00": "Miao",
"16FE0": "Ideographic Symbols and Punctuation",
"17000": "Tangut",
"18800": "Tangut Components",
"1B000": "Kana Supplement",
"1B100": "Kana Extended-A",
"1B130": "Small Kana Extension",
"1B170": "Nushu",
"1BC00": "Duployan",
"1BCA0": "Shorthand Format Controls",
"1D000": "Byzantine Musical Symbols",
"1D100": "Musical Symbols",
"1D200": "Ancient Greek Musical Notation",
"1D2E0": "Mayan Numerals",
"1D300": "Tai Xuan Jing Symbols",
"1D360": "Counting Rod Numerals",
"1D400": "Mathematical Alphanumeric Symbols",
"1D800": "Sutton SignWriting",
"1E000": "Glagolitic Supplement",
"1E100": "Nyiakeng Puachue Hmong",
"1E2C0": "Wancho",
"1E800": "Mende Kikakui",
"1E900": "Adlam",
"1EC70": "Indic Siyaq Numbers",
"1ED00": "Ottoman Siyaq Numbers",
"1EE00": "Arabic Mathematical Alphabetic Symbols",
"1F000": "Mahjong Tiles",
"1F030": "Domino Tiles",
"1F0A0": "Playing Cards",
"1F100": "Enclosed Alphanumeric Supplement",
"1F200": "Enclosed Ideographic Supplement",
"1F300": "Miscellaneous Symbols and Pictographs",
"1F600": "Emoticons",
"1F650": "Ornamental Dingbats",
"1F680": "Transport and Map Symbols",
"1F700": "Alchemical Symbols",
"1F780": "Geometric Shapes Extended",
"1F800": "Supplemental Arrows-C",
"1F900": "Supplemental Symbols and Pictographs",
"1FA00": "Chess Symbols",
"1FA70": "Symbols and Pictographs Extended-A",
"20000": "CJK Unified Ideographs Extension B",
"2A700": "CJK Unified Ideographs Extension C",
"2B740": "CJK Unified Ideographs Extension D",
"2B820": "CJK Unified Ideographs Extension E",
"2CEB0": "CJK Unified Ideographs Extension F",
"2F800": "CJK Compatibility Ideographs Supplement",
"E0000": "Tags",
"E0100": "Variation Selectors Supplement",
"F0000": "Supplementary Private Use Area-A",
"100000": "Supplementary Private Use Area-B"
}
使い方のイメージ
$ php sample.php 'これは迷惑な記事です。'
Array
(
[Hiragana] => 6
[CJK Unified Ideographs] => 4
[CJK Symbols and Punctuation] => 1
)
$ php sample.php '이것은 성가신 기사입니다.'
Array
(
[Hangul Syllables] => 11
[Basic Latin] => 3
)
上記サンプルの日本語の例「これは迷惑な記事です。
」では「ひらがな=6文字」「漢字(CJK統合漢字)=4文字」「記号=1文字」という成分(要素の組み合わせ)でできていることがわかります。
これは、「ひらがな」(Hiragana
)の対応範囲は「3040..309F; Hiragana
」であるため、文字のコードポイント(U+XXXX
)が「U+3040
〜U+309F
」の範囲の場合は「ひらがな」であることが識別できることを利用しています。
PHP で文字種の要素を解析するサンプル
<?php
require_once('functions.php');
$string = trim(fgets(STDIN));
print_r(getNamesBlock($string));
上記でインポートしている関数(functions.php
)は以下。
function getNamesBlock($string):array
{
if (! is_string($string)) {
return false;
}
// 入力をコードポイント(U+XXXX の XXXX)に変換
$array_chars = convertStrToCodePoints($string);
// 1文字ごとに対応するコードポイントのブロック名を取得しカウント
$result = [];
foreach ($array_chars as $codepoint_char) {
if (empty(trim($codepoint_char))) {
continue;
}
$name_block = findNameBlock($codepoint_char);
$result[$name_block] = isset($result[$name_block]) ? ++$result[$name_block] : 1;
}
return $result;
}
function convertStrToCodePoints($string):array
{
// すべてを HTML エンティティ化して UTF-8 コードポイント(U+XXXX の XXXX)を得る
// 「あ」(U+3042)の場合は あ の &#x...; がトリムされ 3042 が取得できる
$entitiy_num = mb_encode_numericentity($string, [0x0, 0x10ffff, 0, 0xffffff], 'UTF-8', true);
$entitiy_num = str_replace(';', '', $entitiy_num);
$results = explode('&#x', $entitiy_num);
return array_filter($results);
}
function findNameBlock($code_target):string
{
$list_block = getListBlockName();
$name_current = reset($list_block);
foreach ($list_block as $code_start => $name_block) {
if (hexdec($code_target) < hexdec($code_start)) {
return $name_current;
break;
}
$name_current = $name_block;
}
return 'No_Block';
}
function getListBlockName():array
{
/**
* コードポイント対応表(UTF-8).
* キーは各コードポイント・ブロックの開始アドレス(START)。終了アドレス(END)は次のブロックのアドレス -1 の値。
*
* @REF: http://www.unicode.org/Public/UNIDATA/Blocks.txt
**/
return [
"0000" => "Basic Latin",
"0080" => "Latin-1 Supplement",
"0100" => "Latin Extended-A",
"0180" => "Latin Extended-B",
"0250" => "IPA Extensions",
"02B0" => "Spacing Modifier Letters",
"0300" => "Combining Diacritical Marks",
"0370" => "Greek and Coptic",
"0400" => "Cyrillic",
"0500" => "Cyrillic Supplement",
"0530" => "Armenian",
"0590" => "Hebrew",
"0600" => "Arabic",
"0700" => "Syriac",
"0750" => "Arabic Supplement",
"0780" => "Thaana",
"07C0" => "NKo",
"0800" => "Samaritan",
"0840" => "Mandaic",
"0860" => "Syriac Supplement",
"08A0" => "Arabic Extended-A",
"0900" => "Devanagari",
"0980" => "Bengali",
"0A00" => "Gurmukhi",
"0A80" => "Gujarati",
"0B00" => "Oriya",
"0B80" => "Tamil",
"0C00" => "Telugu",
"0C80" => "Kannada",
"0D00" => "Malayalam",
"0D80" => "Sinhala",
"0E00" => "Thai",
"0E80" => "Lao",
"0F00" => "Tibetan",
"1000" => "Myanmar",
"10A0" => "Georgian",
"1100" => "Hangul Jamo",
"1200" => "Ethiopic",
"1380" => "Ethiopic Supplement",
"13A0" => "Cherokee",
"1400" => "Unified Canadian Aboriginal Syllabics",
"1680" => "Ogham",
"16A0" => "Runic",
"1700" => "Tagalog",
"1720" => "Hanunoo",
"1740" => "Buhid",
"1760" => "Tagbanwa",
"1780" => "Khmer",
"1800" => "Mongolian",
"18B0" => "Unified Canadian Aboriginal Syllabics Extended",
"1900" => "Limbu",
"1950" => "Tai Le",
"1980" => "New Tai Lue",
"19E0" => "Khmer Symbols",
"1A00" => "Buginese",
"1A20" => "Tai Tham",
"1AB0" => "Combining Diacritical Marks Extended",
"1B00" => "Balinese",
"1B80" => "Sundanese",
"1BC0" => "Batak",
"1C00" => "Lepcha",
"1C50" => "Ol Chiki",
"1C80" => "Cyrillic Extended-C",
"1C90" => "Georgian Extended",
"1CC0" => "Sundanese Supplement",
"1CD0" => "Vedic Extensions",
"1D00" => "Phonetic Extensions",
"1D80" => "Phonetic Extensions Supplement",
"1DC0" => "Combining Diacritical Marks Supplement",
"1E00" => "Latin Extended Additional",
"1F00" => "Greek Extended",
"2000" => "General Punctuation",
"2070" => "Superscripts and Subscripts",
"20A0" => "Currency Symbols",
"20D0" => "Combining Diacritical Marks for Symbols",
"2100" => "Letterlike Symbols",
"2150" => "Number Forms",
"2190" => "Arrows",
"2200" => "Mathematical Operators",
"2300" => "Miscellaneous Technical",
"2400" => "Control Pictures",
"2440" => "Optical Character Recognition",
"2460" => "Enclosed Alphanumerics",
"2500" => "Box Drawing",
"2580" => "Block Elements",
"25A0" => "Geometric Shapes",
"2600" => "Miscellaneous Symbols",
"2700" => "Dingbats",
"27C0" => "Miscellaneous Mathematical Symbols-A",
"27F0" => "Supplemental Arrows-A",
"2800" => "Braille Patterns",
"2900" => "Supplemental Arrows-B",
"2980" => "Miscellaneous Mathematical Symbols-B",
"2A00" => "Supplemental Mathematical Operators",
"2B00" => "Miscellaneous Symbols and Arrows",
"2C00" => "Glagolitic",
"2C60" => "Latin Extended-C",
"2C80" => "Coptic",
"2D00" => "Georgian Supplement",
"2D30" => "Tifinagh",
"2D80" => "Ethiopic Extended",
"2DE0" => "Cyrillic Extended-A",
"2E00" => "Supplemental Punctuation",
"2E80" => "CJK Radicals Supplement",
"2F00" => "Kangxi Radicals",
"2FF0" => "Ideographic Description Characters",
"3000" => "CJK Symbols and Punctuation",
"3040" => "Hiragana",
"30A0" => "Katakana",
"3100" => "Bopomofo",
"3130" => "Hangul Compatibility Jamo",
"3190" => "Kanbun",
"31A0" => "Bopomofo Extended",
"31C0" => "CJK Strokes",
"31F0" => "Katakana Phonetic Extensions",
"3200" => "Enclosed CJK Letters and Months",
"3300" => "CJK Compatibility",
"3400" => "CJK Unified Ideographs Extension A",
"4DC0" => "Yijing Hexagram Symbols",
"4E00" => "CJK Unified Ideographs",
"A000" => "Yi Syllables",
"A490" => "Yi Radicals",
"A4D0" => "Lisu",
"A500" => "Vai",
"A640" => "Cyrillic Extended-B",
"A6A0" => "Bamum",
"A700" => "Modifier Tone Letters",
"A720" => "Latin Extended-D",
"A800" => "Syloti Nagri",
"A830" => "Common Indic Number Forms",
"A840" => "Phags-pa",
"A880" => "Saurashtra",
"A8E0" => "Devanagari Extended",
"A900" => "Kayah Li",
"A930" => "Rejang",
"A960" => "Hangul Jamo Extended-A",
"A980" => "Javanese",
"A9E0" => "Myanmar Extended-B",
"AA00" => "Cham",
"AA60" => "Myanmar Extended-A",
"AA80" => "Tai Viet",
"AAE0" => "Meetei Mayek Extensions",
"AB00" => "Ethiopic Extended-A",
"AB30" => "Latin Extended-E",
"AB70" => "Cherokee Supplement",
"ABC0" => "Meetei Mayek",
"AC00" => "Hangul Syllables",
"D7B0" => "Hangul Jamo Extended-B",
"D800" => "High Surrogates",
"DB80" => "High Private Use Surrogates",
"DC00" => "Low Surrogates",
"E000" => "Private Use Area",
"F900" => "CJK Compatibility Ideographs",
"FB00" => "Alphabetic Presentation Forms",
"FB50" => "Arabic Presentation Forms-A",
"FE00" => "Variation Selectors",
"FE10" => "Vertical Forms",
"FE20" => "Combining Half Marks",
"FE30" => "CJK Compatibility Forms",
"FE50" => "Small Form Variants",
"FE70" => "Arabic Presentation Forms-B",
"FF00" => "Halfwidth and Fullwidth Forms",
"FFF0" => "Specials",
"10000" => "Linear B Syllabary",
"10080" => "Linear B Ideograms",
"10100" => "Aegean Numbers",
"10140" => "Ancient Greek Numbers",
"10190" => "Ancient Symbols",
"101D0" => "Phaistos Disc",
"10280" => "Lycian",
"102A0" => "Carian",
"102E0" => "Coptic Epact Numbers",
"10300" => "Old Italic",
"10330" => "Gothic",
"10350" => "Old Permic",
"10380" => "Ugaritic",
"103A0" => "Old Persian",
"10400" => "Deseret",
"10450" => "Shavian",
"10480" => "Osmanya",
"104B0" => "Osage",
"10500" => "Elbasan",
"10530" => "Caucasian Albanian",
"10600" => "Linear A",
"10800" => "Cypriot Syllabary",
"10840" => "Imperial Aramaic",
"10860" => "Palmyrene",
"10880" => "Nabataean",
"108E0" => "Hatran",
"10900" => "Phoenician",
"10920" => "Lydian",
"10980" => "Meroitic Hieroglyphs",
"109A0" => "Meroitic Cursive",
"10A00" => "Kharoshthi",
"10A60" => "Old South Arabian",
"10A80" => "Old North Arabian",
"10AC0" => "Manichaean",
"10B00" => "Avestan",
"10B40" => "Inscriptional Parthian",
"10B60" => "Inscriptional Pahlavi",
"10B80" => "Psalter Pahlavi",
"10C00" => "Old Turkic",
"10C80" => "Old Hungarian",
"10D00" => "Hanifi Rohingya",
"10E60" => "Rumi Numeral Symbols",
"10F00" => "Old Sogdian",
"10F30" => "Sogdian",
"10FE0" => "Elymaic",
"11000" => "Brahmi",
"11080" => "Kaithi",
"110D0" => "Sora Sompeng",
"11100" => "Chakma",
"11150" => "Mahajani",
"11180" => "Sharada",
"111E0" => "Sinhala Archaic Numbers",
"11200" => "Khojki",
"11280" => "Multani",
"112B0" => "Khudawadi",
"11300" => "Grantha",
"11400" => "Newa",
"11480" => "Tirhuta",
"11580" => "Siddham",
"11600" => "Modi",
"11660" => "Mongolian Supplement",
"11680" => "Takri",
"11700" => "Ahom",
"11800" => "Dogra",
"118A0" => "Warang Citi",
"119A0" => "Nandinagari",
"11A00" => "Zanabazar Square",
"11A50" => "Soyombo",
"11AC0" => "Pau Cin Hau",
"11C00" => "Bhaiksuki",
"11C70" => "Marchen",
"11D00" => "Masaram Gondi",
"11D60" => "Gunjala Gondi",
"11EE0" => "Makasar",
"11FC0" => "Tamil Supplement",
"12000" => "Cuneiform",
"12400" => "Cuneiform Numbers and Punctuation",
"12480" => "Early Dynastic Cuneiform",
"13000" => "Egyptian Hieroglyphs",
"13430" => "Egyptian Hieroglyph Format Controls",
"14400" => "Anatolian Hieroglyphs",
"16800" => "Bamum Supplement",
"16A40" => "Mro",
"16AD0" => "Bassa Vah",
"16B00" => "Pahawh Hmong",
"16E40" => "Medefaidrin",
"16F00" => "Miao",
"16FE0" => "Ideographic Symbols and Punctuation",
"17000" => "Tangut",
"18800" => "Tangut Components",
"1B000" => "Kana Supplement",
"1B100" => "Kana Extended-A",
"1B130" => "Small Kana Extension",
"1B170" => "Nushu",
"1BC00" => "Duployan",
"1BCA0" => "Shorthand Format Controls",
"1D000" => "Byzantine Musical Symbols",
"1D100" => "Musical Symbols",
"1D200" => "Ancient Greek Musical Notation",
"1D2E0" => "Mayan Numerals",
"1D300" => "Tai Xuan Jing Symbols",
"1D360" => "Counting Rod Numerals",
"1D400" => "Mathematical Alphanumeric Symbols",
"1D800" => "Sutton SignWriting",
"1E000" => "Glagolitic Supplement",
"1E100" => "Nyiakeng Puachue Hmong",
"1E2C0" => "Wancho",
"1E800" => "Mende Kikakui",
"1E900" => "Adlam",
"1EC70" => "Indic Siyaq Numbers",
"1ED00" => "Ottoman Siyaq Numbers",
"1EE00" => "Arabic Mathematical Alphabetic Symbols",
"1F000" => "Mahjong Tiles",
"1F030" => "Domino Tiles",
"1F0A0" => "Playing Cards",
"1F100" => "Enclosed Alphanumeric Supplement",
"1F200" => "Enclosed Ideographic Supplement",
"1F300" => "Miscellaneous Symbols and Pictographs",
"1F600" => "Emoticons",
"1F650" => "Ornamental Dingbats",
"1F680" => "Transport and Map Symbols",
"1F700" => "Alchemical Symbols",
"1F780" => "Geometric Shapes Extended",
"1F800" => "Supplemental Arrows-C",
"1F900" => "Supplemental Symbols and Pictographs",
"1FA00" => "Chess Symbols",
"1FA70" => "Symbols and Pictographs Extended-A",
"20000" => "CJK Unified Ideographs Extension B",
"2A700" => "CJK Unified Ideographs Extension C",
"2B740" => "CJK Unified Ideographs Extension D",
"2B820" => "CJK Unified Ideographs Extension E",
"2CEB0" => "CJK Unified Ideographs Extension F",
"2F800" => "CJK Compatibility Ideographs Supplement",
"E0000" => "Tags",
"E0100" => "Variation Selectors Supplement",
"F0000" => "Supplementary Private Use Area-A",
"100000" => "Supplementary Private Use Area-B",
];
}
- オンラインで動作をみる @ paiza.IO
TS;DR
U+XXXX
の名前は「コードポイント」
Unicode の文字に関して調べていると U+XXXX
の文字コード形式をみかけます。UTF-8 の「あ」は U+3042
であるなどです。
この U+XXXX
形式の文字コードを Unicode では「コードポイント」と呼びます。日本語で小難しく言うと「Unicode 符号位置」と呼ばれます。
Normally, a Unicode code point is referred to by writing "U+" followed by its hexadecimal number.
(Architecture and terminology | Unicode @ Wikipedia 英語版より)文字を特定する場合にはUnicode符号位置や一意につけられた名前が使われる。例えば「a」はU+0061 (LATIN SMALL LETTER A)、「♪」はU+266A (EIGHTH NOTE)である。Unicode符号位置を文章中などに記す場合などは "U+" の後に十六進法で符号位置を4桁から6桁続けることで表す。
(文字集合 | Unicode @ Wikipedia 日本語版より)
「コードポイント」や「Unicode 符号位置」以外にも、U+XXXX
の状態を UNICODE ESCAPED CHARACTER
(『「UNICODE にエスケープされた」文字』)とも呼ばれたりします。
しかし、PHP の json_encode()
の JSON_UNESCAPED_UNICODE
のフラグ・オプションのように「\uXXXX
にエスケープしない」という文脈での使い方が多いようです。
コードポイントは文字コードなのか
Unicode、とりわけ UTF-8 の「コードポイント」は文字コードなのかというと微妙に違います。いや、「文字のコード」には違いないのですが。
例えば、ASCII コードのように 文字のバイナリの値 = 文字コード
とはなりません。
bin2hex('a')
は0x61
と、U+0061
になるが、
bin2hex('あ')
は0xe38182
となり、U+3042
にはならない。
"a
"(小文字の A
)の場合の ASCII コードの文字コードは "0x61" です。PHP だと bin2hex()
関数で確認できます。
$ php -r "echo bin2hex('a'), PHP_EOL;"
61
これは UTF-8 上の "a
" である U+0061
と一致します。しかし、UTF-8 の文字「あ
」の場合、bin2hex()
を使っても U+3042
にはなりません。
$ php -r "echo bin2hex('あ'), PHP_EOL;"
e38182
なぜ違うのかは本記事では割愛しますが、欲しいのは U+3042
のコード 3042 です。ちなみにコードポイントの理解に関しては JavaScript ではあるものの、以下の記事がとても理解の役に経ちました。
- JavaScript における文字コードと「文字数」の数え方 @ blog.jxck.io
- JavaScriptで絵文字とサロゲートペアと結合文字を正しく扱うのに少し苦労した話 @ Qiita
UTF-8 文字を Unicode エスケープ(U+XXXX
形式のコードポイントに変換)する
PHP の場合、標準の関数で以下の2通りで変換できます。
-
json_encode()
でエンコードする- json_encode @ php.net
JSONに変換する$ php -r "echo json_encode('あ'), PHP_EOL;" "\u3042"
-
mb_encode_numericentity()
でエンコードする- mb_encode_numericentity @ php.net
全ての文字をHTML数値エンティティにエンコードする$ php -r "echo mb_encode_numericentity('あ', [0x0, 0x10ffff, 0, 0xffffff], 'UTF-8', true), PHP_EOL;" あ
おすすめは、後者の mb_encode_numericentity()
です。
というのも「あaいi
」と [a-zA-Z]
が混在した場合、json_encode()
だと区切り位置があいまいになります。つまり、エンコードされた文字とそうでない文字が混在するため、コードの数値を取得するトリミング処理が面倒だからです。
$ php -r "echo json_encode('あaいi'), PHP_EOL;"
"\u3042a\u3044i"
上記のように \u3042a
か \u3042
か判断に迷ってしまいます。なぜなら U+XXXX
の XXXX は 2〜6 桁と可変であるため、必ずしも 4 桁とは限らないからです。実は最初のビットを調べれば何桁かわかるのですが、分岐処理の実装が面倒です。また、1文字ずつループさせて JSON エンコードしてもいいのですが、若干のコストがかかります。
その反面、mb_encode_numericentity
関数の場合は、すべてを &#x〜;
のフォーマットで変換してくれます。(〜
がコードポイント。このフォーマットを「HTML 数値エンティティ」と呼びます。)
$ php -r "echo mb_encode_numericentity('あaいi', [0x0, 0x10ffff, 0, 0xffffff], 'UTF-8', true), PHP_EOL;"
あaいi
これを元に、文字列内の各々の文字のコードポイントを取得できます。以下は PHP で実装したユーザー関数のサンプルです。
function convertStrToCodePoints($string):array
{
//変換するコード領域を指定
$map_conv = [
0x0, // Start Code
0x10ffff, // End Code
0, // Offset
0xffffff // Mask
];
$is_hex = true;
// すべてを HTML 数値エンティティ化する
$entitiy_num = mb_encode_numericentity($string, $map_conv , 'UTF-8', $is_hex);
// &#x と ; の間の数値だけを抜き出して、配列で返す
$entitiy_num = str_replace(';', '', $entitiy_num);
$results = explode('&#x', $entitiy_num);
return array_filter($results);
}
<?php
require_once('functions.php');
print_r( convertStrToCodePoints('あa') );
$ # あ => U+3042, a => U+0061
$ php sample2.php
Array
(
[1] => 3042
[2] => 61
)
ターゲットとなる文字のコードポイントがわかれば、あとは対応表と比較してブロック名を取得します。
以下のユーザー関数で、該当する範囲のブロック名を取得します。ちなみに対応表は getListBlockName()
より「[コードポイントの開始アドレス => ブロック名,...]
」の配列データを取得しています。
function findNameBlock($code_target):string
{
$list_block = getListBlockName();
$name_current = reset($list_block);
foreach ($list_block as $code_start => $name_block) {
if (hexdec($code_target) < hexdec($code_start)) {
return $name_current;
break;
}
$name_current = $name_block;
}
return 'No_Block';
}
<?php
require_once('functions.php');
// ターゲットの文字
$string = 'あa';
// 入力をコードポイント(U+XXXX の XXXX)に各々変換し、配列で取得する
$codepoints = convertStrToCodePoints($string);
// 1文字ごとに対応するコードポイントのブロック名を取得しカウント
$result = [];
foreach ($codepoints as $codepoint) {
$name_block = findNameBlock($codepoint);
$result[$name_block] = isset($result[$name_block]) ? ++$result[$name_block] : 1;
}
print_r($result);
$ # 'あa' の要素は「ひらがな」=1文字、「Basic Latin」=1文字
$ php sample3.php
Array
(
[Hiragana] => 1
[Basic Latin] => 1
)
以上のようなデータを機械学習向けの書式に変換すれば、「スパム」か「スパムでない」のフラグ(クラス数 2)を添えて機械学習用に使えそうです。
具体的には、一旦データを CSV 形式に変換し、yes
no
などの文字列にはラベル番号を振ってフラグ化して数値だけの表にします。
Basic latin | Hiragana | Katakana | Hangul Syllables | ... | Is Spam | |
---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 | ... | 0 |
2 | 14 | 120 | 0 | 0 | ... | 0 |
3 | 3 | 0 | 0 | 113 | ... | 1 |
n | ... | ... | ... | ... | ... | ... |
次に、測定したデータ($x
)と答えのデータ($y
)にわけヘッダー行を除いた表を配列にします。
# Data (Basic latin, Hiragana, Katakana, Hangul Syllables, ...)
$x = [
[ 1, 1, 0, 0, ... ],
[ 14, 120, 0, 0, ... ],
[ 3, 0, 0,113, ... ],
...
];
# Target (Is Spam)
$y = [
[ 0 ],
[ 0 ],
[ 1 ],
...
];
おそらく、上記の Is Spam
のクラス数 2(スパムでない=0
, スパムである=1
)の代わりに、言語ごとに ID を振って「不明=0
, 日本語=1
, 英語=2
, 韓国語=3
, ...」と、クラス数を増やせば言語検知にも使えるかもしれません。
ちなみに Qiita のスパム記事は有志によりアーカイブ化されているので、それをコーパス(機械学習用のネタとなるデータ)として使えればと考えています。
まだまだ勉強中ですが、何か出来たら Docker で動くようにして別途記事をあげたいと思います。