More than 1 year has passed since last update.

Pythonで歌詞を集めて軽く分析してみた話

Posted at 2022-05-18

これを作ろうと思った理由

・ただ単に歌の歌詞の頻出度が気になったから
・これを使えばなにかいい歌詞が作れるのではないか

コード

まずコードからお見せします。コードが雑ですが…

例のコード

import requests
from bs4 import BeautifulSoup
import sentencepiece as spm
import os

net = range(1,318500)

for i in net:
    kasi = ""
    memory = ""
    load_url = "https://www.uta-net.com/song/" + str(i) + "/"
    html = requests.get(load_url)
    soup = BeautifulSoup(html.content ,"html.parser")
    for element in soup.find_all(id="kashi_area"):
        memory = element.text
    kasi += memory.replace("　", "\n").replace(" ", "\n")
    path = f"歌詞{str(i)}.txt"
    if not(os.path.isfile(path)):
        with open(path,"w",encoding="utf-8")as file:
            file.write(str(kasi))
    print(i)

count = {}
for a in net:
    path = f"歌詞{str(a)}.txt"
    if os.path.isfile(path):
        with open(path, "r", encoding="utf-8")as readfile:
            kasi = readfile.read()
        if not(kasi == "[]" or kasi == None or kasi == ""):
            TEXT = kasi
            # 学習済みモデルの読み込み
            sp = spm.SentencePieceProcessor()
            sp.load('trained_model.model')
            # 分割した結果を表示
            result = sp.EncodeAsPieces(TEXT)
            #result = "".join(result)
            print(result)
            for c in result:
                if c in count:
                    count[c] = count[c] + 1
                else:
                    count[c] = 1

new_count = {}
point = 0
for d in count:
    if count[d] <= point or str(d).isascii() or len(str(d)) <= 2:
        pass
    else:
        new_count[d] = count[d]

new_count = sorted(new_count.items(), key=lambda x:x[1])
print(new_count)

new_count = str(new_count).replace("('","").replace("',",":").replace(",","\n").replace(")","").replace("["," ").replace("]","")

with open("結果メモ.txt","w",encoding="utf-8")as file:
    file.write(new_count)

仕組み(?)

1 歌ネットから歌詞を集める
2 それをtxtファイルに保存する
3 それが終わったら、今度はファイルを読み込む
4 AIによって文章を単語単語で分けさせる
5 出てきた文字をカウントする
6 それを出力

という感じです。

今の状況報告

今ざっと1万近くデータが集まっています。

データが4-5千くらいのときの歌詞頻出度があるので公開します。
（3文字以上だけの語句を出しています）
順位語句:出てきた回数
5 じゃない: 1991
4 きっと: 2007
3 なんて: 2356
2 れない: 2907
1 あなた: 5923

…こんな感じですね

また今の状況報告や、プログラムの変更があったらまた記事を書きます。
見ていただき、ありがとうございました

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up