概要
現代は、小中学生のタブレットやPCなどの利用率は、なんと9割にも上るとのこと。そのため、WEBのコンテンツも小中学生向けに作っていくのは、とても需要がありますよね!
今回は、形態素解析(igo-php
)を利用して、小学生の学年にあわせて習得漢字以外をひらがなに変換にトライしたいと思います。
メリットしては、以下のような感じかなと思います。
- Webやアプリのコンテンツの漢字を小学校の学年に合わせた内容に表示変更できる
- 学年別にバラバラに作成していたテキストを、1つ作成して変換するだけで横展開できる。
使用する言語やライブラリ
- PHP: 8.1.17
- igo-php: 0.2.1
使用する漢字について
文部科学省の「別表 学年別漢字配当表」に掲載されている漢字を使わせていただきました。
プログラム
「漢字取得ロジック」と「形態素解析でひらがな変換ロジック」に分けて説明します。
漢字取得ロジック
とりあえず、文部科学省に掲載されている漢字を学年毎に配列
で分けました。
// 1年生(80字)
$grade_1 = ["一","右","雨","円","王","音","下","火","花","貝","学","気","九","休","玉","金","空","月","犬","見","五","口","校","左","三","山","子","四","糸","字","耳","七","車","手","十","出","女","小","上","森","人","水","正","生","青","夕","石","赤","千","川","先","早","草","足","村","大","男","竹","中","虫","町","天","田","土","二","日","入","年","白","八","百","文","木","本","名","目","立","力","林","六"];
// 2年生(160字)
$grade_2 = ["引","羽","雲","園","遠","何","科","夏","家","歌","画","回","会","海","絵","外","角","楽","活","間","丸","岩","顔","汽","記","帰","弓","牛","魚","京","強","教","近","兄","形","計","元","言","原","戸","古","午","後","語","工","公","広","交","光","考","行","高","黄","合","谷","国","黒","今","才","細","作","算","止","市","矢","姉","思","紙","寺","自","時","室","社","弱","首","秋","週","春","書","少","場","色","食","心","新","親","図","数","西","声","星","晴","切","雪","船","線","前","組","走","多","太","体","台","地","池","知","茶","昼","長","鳥","朝","直","通","弟","店","点","電","刀","冬","当","東","答","頭","同","道","読","内","南","肉","馬","売","買","麦","半","番","父","風","分","聞","米","歩","母","方","北","毎","妹","万","明","鳴","毛","門","夜","野","友","用","曜","来","里","理","話"];
// 3年生(200字)
$grade_3 = ["悪","安","暗","医","委","意","育","員","院","飲","運","泳","駅","央","横","屋","温","化","荷","界","開","階","寒","感","漢","館","岸","起","期","客","究","急","級","宮","球","去","橋","業","曲","局","銀","区","苦","具","君","係","軽","血","決","研","県","庫","湖","向","幸","港","号","根","祭","皿","仕","死","使","始","指","歯","詩","次","事","持","式","実","写","者","主","守","取","酒","受","州","拾","終","習","集","住","重","宿","所","暑","助","昭","消","商","章","勝","乗","植","申","身","神","真","深","進","世","整","昔","全","相","送","想","息","速","族","他","打","対","待","代","第","題","炭","短","談","着","注","柱","丁","帳","調","追","定","庭","笛","鉄","転","都","度","投","豆","島","湯","登","等","動","童","農","波","配","倍","箱","畑","発","反","坂","板","皮","悲","美","鼻","筆","氷","表","秒","病","品","負","部","服","福","物","平","返","勉","放","味","命","面","問","役","薬","由","油","有","遊","予","羊","洋","葉","陽","様","落","流","旅","両","緑","礼","列","練","路","和"];
// 4年生(200字)
$grade_4 = ["愛","案","以","衣","位","囲","胃","印","英","栄","塩","億","加","果","貨","課","芽","改","械","害","街","各","覚","完","官","管","関","観","願","希","季","紀","喜","旗","器","機","議","求","泣","救","給","挙","漁","共","協","鏡","競","極","訓","軍","郡","径","型","景","芸","欠","結","建","健","験","固","功","好","候","航","康","告","差","菜","最","材","昨","札","刷","殺","察","参","産","散","残","士","氏","史","司","試","児","治","辞","失","借","種","周","祝","順","初","松","笑","唱","焼","象","照","賞","臣","信","成","省","清","静","席","積","折","節","説","浅","戦","選","然","争","倉","巣","束","側","続","卒","孫","帯","隊","達","単","置","仲","貯","兆","腸","低","底","停","的","典","伝","徒","努","灯","堂","働","特","得","毒","熱","念","敗","梅","博","飯","飛","費","必","票","標","不","夫","付","府","副","粉","兵","別","辺","変","便","包","法","望","牧","末","満","未","脈","民","無","約","勇","要","養","浴","利","陸","良","料","量","輪","類","令","冷","例","歴","連","老","労","録"];
// 5年生(185字)
$grade_5 = ["圧","移","因","永","営","衛","易","益","液","演","応","往","桜","恩","可","仮","価","河","過","賀","快","解","格","確","額","刊","幹","慣","眼","基","寄","規","技","義","逆","久","旧","居","許","境","均","禁","句","群","経","潔","件","券","険","検","限","現","減","故","個","護","効","厚","耕","鉱","構","興","講","混","査","再","災","妻","採","際","在","財","罪","雑","酸","賛","支","志","枝","師","資","飼","示","似","識","質","舎","謝","授","修","述","術","準","序","招","承","証","条","状","常","情","織","職","制","性","政","勢","精","製","税","責","績","接","設","舌","絶","銭","祖","素","総","造","像","増","則","測","属","率","損","退","貸","態","団","断","築","張","提","程","適","敵","統","銅","導","徳","独","任","燃","能","破","犯","判","版","比","肥","非","備","俵","評","貧","布","婦","富","武","復","複","仏","編","弁","保","墓","報","豊","防","貿","暴","務","夢","迷","綿","輸","余","預","容","略","留","領"];
// 6年生(181字)
$grade_6 = ["異","遺","域","宇","映","延","沿","我","灰","拡","革","閣","割","株","干","巻","看","簡","危","机","揮","貴","疑","吸","供","胸","郷","勤","筋","系","敬","警","劇","激","穴","絹","権","憲","源","厳","己","呼","誤","后","孝","皇","紅","降","鋼","刻","穀","骨","困","砂","座","済","裁","策","冊","蚕","至","私","姿","視","詞","誌","磁","射","捨","尺","若","樹","収","宗","就","衆","従","縦","縮","熟","純","処","署","諸","除","将","傷","障","城","蒸","針","仁","垂","推","寸","盛","聖","誠","宣","専","泉","洗","染","善","奏","窓","創","装","層","操","蔵","臓","存","尊","宅","担","探","誕","段","暖","値","宙","忠","著","庁","頂","潮","賃","痛","展","討","党","糖","届","難","乳","認","納","脳","派","拝","背","肺","俳","班","晩","否","批","秘","腹","奮","並","陛","閉","片","補","暮","宝","訪","亡","忘","棒","枚","幕","密","盟","模","訳","郵","優","幼","欲","翌","乱","卵","覧","裏","律","臨","朗","論"];
下記が学年を指定をすると、習得した漢字を配列で返してくれます。
4年生を指定すると、1年生から4年生までを返します。
(雑に作成してしまったので、もっと最適化できる気がします)
/*
* 学年指定で習得漢字の一覧を返す
*
* @param int $grade 学年指定
* @return array 習得漢字
*/
function getKanji($grade) {
global $grade_1, $grade_2, $grade_3, $grade_4, $grade_5, $grade_6;
$result = array();
if ($grade === 1) {
$result = array_merge($result,$grade_1);
} else if ($grade === 2) {
$result = array_merge($result,$grade_1);
$result = array_merge($result,$grade_2);
} else if ($grade === 3) {
$result = array_merge($result,$grade_1);
$result = array_merge($result,$grade_2);
$result = array_merge($result,$grade_3);
} else if ($grade === 4) {
$result = array_merge($result,$grade_1);
$result = array_merge($result,$grade_2);
$result = array_merge($result,$grade_3);
$result = array_merge($result,$grade_4);
} else if ($grade === 5) {
$result = array_merge($result,$grade_1);
$result = array_merge($result,$grade_2);
$result = array_merge($result,$grade_3);
$result = array_merge($result,$grade_4);
$result = array_merge($result,$grade_5);
} else if ($grade === 6) {
$result = array_merge($result,$grade_1);
$result = array_merge($result,$grade_2);
$result = array_merge($result,$grade_3);
$result = array_merge($result,$grade_4);
$result = array_merge($result,$grade_5);
$result = array_merge($result,$grade_6);
}
return $result;
}
形態素解析でひらがな変換ロジック
igo-php
で分割した文字単位で、すべての文字が習得済みの場合に漢字で表示。一つでも未収得の漢字があれば、ひらがなにしています。
igo-php
で分割した文字を一つずつ判定する必要があるのですが、送りがなが付くような文字を判定するために、漢字の判定する関数を作成しています。
require 'vendor/autoload.php';
$igo = new Igo\Tagger();
/*
* 漢字の判定
*
* @param str $s 文字1文字
* @return boolern 漢字:true 漢字以外:false
*/
function isKanji($s) {
return preg_match( '/^[一-龠]$/u', $s) === 1 ? true : false;
}
// 文章指定
$text = "祇園精舎の鐘の声、諸行無常の響きあり。娑羅双樹の花の色、盛者必衰の理をあらはす。";
// igo-phpで文章を分割
$igoData = $igo->parse($text);
// 学年指定で漢字取得
$kanjiArray = getKanji(1);
//分割された文字単位で修得漢字を判定し、文章の生成
$resultText = "";
foreach ($igoData as $value) {
$feature = $value->feature;
$learningFlg = true;
// 1文字ずづ取り出し
foreach (mb_str_split($feature[6]) as $str) {
//漢字かどうかを判別
if (isKanji($str)) {
//習得漢字に含まれているかどうかをチェック
if (array_search($str, $kanjiArray) === false) {
$learningFlg = false;//1単語の中に未習得の漢字が含まれていたらfalseを指定
}
}
}
// テキストの結合
if ($learningFlg) {
$resultText .= $value->surface; //漢字
} else {
$resultText .= $feature[7]; //カナ
}
}
//カタカナをひらがなに変換して表示
print_r(mb_convert_kana($resultText, "c", "utf-8"));
結果
漢字が多めサンプルとして、「平家物語」の冒頭を使いました。(「平家物語」は、中学二年生で習うそうです)
問題点がいくつかありますが、2年生や6年生で漢字が増えていて楽しいですね。
学年 | 文章 |
---|---|
原文 | 祇園精舎の鐘の声、諸行無常の響きあり。娑羅双樹の花の色、盛者必衰の理をあらはす。 |
1年生 | ぎおんしょうじゃのかねのこえ、しょぎょうむじょうのひびきあり。娑羅そうじゅの花のいろ、じょうしゃ必衰のりをあらはす。 |
2年生 | ぎおんしょうじゃのかねの声、しょぎょうむじょうのひびきあり。娑羅そうじゅの花の色、じょうしゃ必衰の理をあらはす。 |
3年生 | ぎおんしょうじゃのかねの声、しょぎょうむじょうのひびきあり。娑羅そうじゅの花の色、じょうしゃ必衰の理をあらはす。 |
4年生 | ぎおんしょうじゃのかねの声、しょぎょうむじょうのひびきあり。娑羅そうじゅの花の色、じょうしゃ必衰の理をあらはす。 |
5年生 | ぎおん精舎のかねの声、しょぎょうむじょうのひびきあり。娑羅そうじゅの花の色、じょうしゃ必衰の理をあらはす。 |
6年生 | ぎおん精舎のかねの声、諸行無常のひびきあり。娑羅そう樹の花の色、盛者必衰の理をあらはす。 |
課題
結果からお気付きかと思いますが、いくつか課題があります。
課題1. 「娑羅」や「必衰」がひらがなになっていない
igo-php
のオブジェクトを確認してみると、feature[7]
が*
になっており、igo-php
で対応していない漢字のようです。ここは違う実現方法を探す必要がありそうです。
[0] => Igo\Morpheme Object
(
[surface] => 祇園
[feature] => Array
(
[0] => 名詞
[1] => 固有名詞
[2] => 地域
[3] => 一般
[4] => *
[5] => *
[6] => 祇園
[7] => ギオン
[8] => ギオン
)
[start] => 0
)
[12] => Igo\Morpheme Object
(
[surface] => 娑羅
[feature] => Array
(
[0] => 名詞
[1] => 一般
[2] => *
[3] => *
[4] => *
[5] => *
[6] => *
)
[start] => 19
)
課題2. 「理」を「ことわり」と読めていない
こちらも課題1と同様にigo-php
で対応していないのが原因のようです。
igo-php
のオブジェクトを確認してみると、「ことわり」ではなく「り」になっていました。
違う実現方法を探す必要がありそうです。
[23] => Igo\Morpheme Object
(
[surface] => 理
[feature] => Array
(
[0] => 名詞
[1] => 一般
[2] => *
[3] => *
[4] => *
[5] => *
[6] => 理
[7] => リ
[8] => リ
)
[start] => 33
)
課題3. 実は、カタカナがひらがなに変換されてしまう
処理の最後に、下記のように「カタカナをひらがなに変換」しているのですが、「コンピューター」が「こんぴゅーたー」になります。カタカナのチェック処理を入れると何とかなりそうです。
//カタカナをひらがなに変換して表示
print_r(mb_convert_kana($resultText, "c", "utf-8"));
おわりに
今回は、「形態素解析を利用して、小学生の習得漢字以外をひらがなに変換」について説明しました。
「PHP強化月間」の記事投稿キャンペーンということもあってPHPで実装してみました。
また、半日程度の開発だったので課題も沢山あるのと応用すると色々なサービスが作れそうなので、時間があるときにでも更新していこうかと思います。
Githubに公開してますので、興味があれば見てみてください。(プルリクもお待ちしております)
以上。ここまで読んでいただきありがとうございました。