More than 3 years have passed since last update.

【PowerShell】文字列の読みがなを取得する

Last updated at 2020-09-19Posted at 2020-09-19

できたもの

> "納豆（遺伝子組み換えでない）"|Get-ReadingWithSudachi|fl

Line     : 納豆（遺伝子組み換えでない）
Reading  : ナットウ（イデンシクミカエデナイ）
Tokenize : 納豆(ナットウ)/（/遺伝子(イデンシ)/組み換え(クミカエ)/で/ない/）
Markup   : <p><ruby>納豆<rt>ナットウ</rt></ruby>（<ruby>遺伝子<rt>イデンシ</rt></ruby>
           <ruby>組み換え<rt>クミカエ</rt></ruby>でない）</p>

コード

環境：

> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.0.3
PSEdition                      Core
GitCommitId                    7.0.3
OS                             Microsoft Windows 10.0.18362
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

以前に書いた、 SudachiPy での形態素解析を呼び出します（【PowerShell】 SudachiPy で形態素解析する）。

function Get-ReadingWithSudachi {
    param (
        [switch]$readingOnly,
        [switch]$ignoreParen
    )
    $ret = New-Object System.Collections.ArrayList
    $tokenizedResults = $input | Invoke-SudachiTokenizer -ignoreParen:$ignoreParen
    foreach ($result in $tokenizedResults) {
        $reading = New-Object System.Text.StringBuilder
        $tokenize = New-Object System.Collections.ArrayList
        $markup = New-Object System.Collections.ArrayList

        foreach ($token in $result.parsed) {

            $tokenSurface = $token.surface
            if ($token.pos -match "記号|空白" -or $tokenSurface -match "^([ァ-ヴ・ー]|[a-zA-Zａ-ｚＡ-Ｚ]|[0-9０-９]|[\W\s])+$") {
                $tokenReading = $tokenSurface
                $tokenInfo = $tokenSurface
                $tokenMarkup = $tokenSurface
            }
            elseif (-not $token.reading) {
                $tokenReading = $tokenSurface
                $tokenInfo = "$($tokenSurface)(?)"
                $tokenMarkup = $tokenSurface
            }
            else {
                $tokenReading = $token.reading
                $tokenInfo = ($tokenSurface -match "^[ぁ-ん]+$")?
                    $tokenSurface :
                    "$($tokenSurface)($tokenReading)"
                $tokenMarkup = ($tokenSurface -match "^[ぁ-ん]+$")?
                    $tokenSurface :
                    "<ruby>{0}<rt>{1}</rt></ruby>" -f $tokenSurface, $tokenReading
            }
            $reading.Append($tokenReading) > $null
            $tokenize.Add($tokenInfo) > $null
            $markup.Add($tokenMarkup) > $null
        }

        $ret.Add([PSCustomObject]@{
            Line = $result.line
            Reading = $reading.ToString()
            Tokenize = $tokenize -join "/"
            Markup = "<p>{0}</p>" -f ($markup -join "")
        }) > $null

    }

    return ($readingOnly)? $ret.reading : $ret
}

html マークアップ

たまにこのような感じで専門用語の解析に失敗します。

1つ2つであれば目視でチェックできますが、数百行を処理するとなると困るので、 Markup というプロパティで html マークアップを吐き出すようにしました。

(cat hogehoge.txt |Get-ReadingWithSudachi).markup|Out-File hogehoge.html

上記のようにして html 化して、ブラウザで確認すれば多少は見落としが減る…と信じています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up