0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

がるちゃんのテキスト部分を抽出するPGM

Last updated at Posted at 2024-11-02

がるちゃんのテキスト部分を抽出するPGM

がるちゃんのテキスト部分を抽出するPGMをpowershellで作成しました. 使い方は、ガルちゃんなどページの内容が書かれたソースコードをダウンロードして 以下のpowershellのパスを適宜設定して、実行します。

2024.11.03 VoiceVox向けに修正したものを追加しました.


PGM内容
$lines = Get-Content ガルちゃんの内容がかかれたソースコードが置いてあるパス\garuChan1\text1.txt -Encoding UTF8

# Write-Host $lines

$cmtFlg = $false;
$outFilePath = "出力先ファイルパス\output1.txt"


for($i=0; $i -lt $lines.Length; $i++){
    $line1 = $lines[$i]

    #write-host $line1
    #Read-Host "aaa"


    if($line1.IndexOf("</title>") -ne -1){
        $line2 = $line1
        write-host $line2
        Write-Output $line2 | Out-File  $outFilePath -Encoding utf8
        continue

    }

    if($line1.IndexOf("body lv") -ne -1){
        #Write-Host $lines[$i]
        #Read-Host "bbb"


        $line2 = $lines[$i+1];
        for($j=$i+1; $line2.IndexOf("</div>") -eq -1; $j++){
            $line2 = $lines[$j]

            $line3 = $line2 -replace  ‘<[^>]+>’,''

            while($line3.IndexOf("$gt;") -ne -1){
                $line2 = $lines[$j]
                $line3 = $line2 -replace  ‘<[^>]+>’,''
                $j++
            }

            
            
            $line2 = $lines[$j]
            $line4 = $line2 -replace  ‘<[^>]+>’,''
            $line4 = $line4 -replace ' ', ''

            if($line4 -match '[+-][0-9]+'){
                $line4 = ''
            }
            if($line4.IndexOf('返信') -ne -1){
                $line4 = ''
            }

            if($line4 -ne ''){
                if($cmtFlg -eq $false){
                    $cmtFlg = $true
                    #Write-Host "---"

                    Write-Output "---" | Out-File  $outFilePath -Append -Encoding utf8
                }
                #Write-Host $line4
                Write-Output $line4 | Out-File  $outFilePath -Append -Encoding utf8

            }


           # Read-Host "ccc"
        }

        $cmtFlg = $false
        $i = $j+1
    }
}

PGM内容2(VoiceVox向けに修正しました)

$lines = Get-Content ガルちゃんの内容がかかれたソースコードが置いてあるパス\garuChan1\text1.txt -Encoding UTF8

# Write-Host $lines

$cmtFlg = $false;
$outFilePath = "出力先ファイルパス\output1.txt"


$turn = 0;

for($i=0; $i -lt $lines.Length; $i++){
    $line1 = $lines[$i]

    #write-host $line1
    #Read-Host "aaa"


    if($line1.IndexOf("</title>") -ne -1){
        $line2 = $line1
        write-host $line2
        Write-Output $line2 | Out-File  $outFilePath -Encoding utf8
        continue

    }

    if($line1.IndexOf("body lv") -ne -1){
        #Write-Host $lines[$i]
        #Read-Host "bbb"


        $line2 = $lines[$i+1];
        for($j=$i+1; $line2.IndexOf("</div>") -eq -1; $j++){
            $line2 = $lines[$j]

            $line3 = $line2 -replace  ‘<[^>]+>’,''

            while($line3.IndexOf("$gt;") -ne -1){
                $line2 = $lines[$j]
                $line3 = $line2 -replace  ‘<[^>]+>’,''
                $j++
            }

            
            
            $line2 = $lines[$j]
            $line4 = $line2 -replace  ‘<[^>]+>’,''
            $line4 = $line4 -replace ' ', ''

            if($line4 -match '[+-][0-9]+'){
                $line4 = ''
            }
            if($line4.IndexOf('返信') -ne -1){
                $line4 = ''
            }

            if($line4 -ne ''){
                if($cmtFlg -eq $false){
                    $cmtFlg = $true
                    #Write-Host "---"

                    Write-Output "---" | Out-File  $outFilePath -Append -Encoding utf8
                    $turn++
                }
                #Write-Host $line4

                if( ($turn % 2) -eq 0){
                    $head = "ずんだもん,"
                }else{
                    $head = "四国めたん,"
                }

                $lines5 = $line4.Split('、,。')
                for($k=0; $k -lt $lines5.Length; $k++){
                    if($lines5[$k] -ne ''){
                        $line6 = $head + $lines5[$k]
                        Write-Output $line6 | Out-File  $outFilePath -Append -Encoding utf8
                    }
                }
            }


           # Read-Host "ccc"
        }

        $cmtFlg = $false
        $i = $j+1
    }
}

参考にしたサイト Powershell入門 - 14.正規表現② #初心者向け - Qiita
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?