More than 5 years have passed since last update.

AV女優の特徴ってなんだろう？　作品名から推測してみた！(^_^)/~~

Last updated at 2020-07-13Posted at 2020-07-08

はじめに

みなさん、AVの作品名って気にしたことありますか？

私はふとした瞬間にある疑問が浮かびました。

「AVの作品名って、AV女優の特徴を表してるんじゃね？」
「もしそうなら、その特徴から自分のAV癖が分かるんじゃないかな？」

そう思ったら、いざ行動！
やっていきましょう

今回はワードクラウドという手法を用いて、仮説を立証していきます。
(私の好きな七沢みあさんに協力してもらいます。)

ワードクラウドとは？

「ワードクラウド」とは、文章中に現れる出現頻度の高い単語を抽出し、1枚の絵にしたものです。
ある文章がどんな傾向なのか視覚的に”パッと見”で分かるので、手っ取り早く、かつ取っつきやすい方法のひとつです。

HTML取得

import requests #webページを取得するライブラリ
from bs4 import BeautifulSoup #取得したHTMLのデータの中から、タグを読み取り、操作できるライブラリ

url = "https://ja.wikipedia.org/wiki/%E4%B8%83%E6%B2%A2%E3%81%BF%E3%81%82" #七沢みあのwikiURL
response = requests.get(url)
response.encoding = response.apparent_encoding #response.apparent_encoding に、正しい文字コードである SHIFT_JISが格納されている(文字化けを防げます)
soup = BeautifulSoup(response.text, "html.parser") #BeautifulSoup(解析対象のHTML/XML, 利用するパーサー(解析器))

# HTMLをインデントできる
print(soup.prettify())

正しくHTMLが取得できました。

作品名取得

span_list1=soup.findAll("td")
titles=[]
for i in span_list1:
    tmp=i.find("b")
    if tmp==None:
        continue
    else:
        print(tmp.text)
        titles.append(tmp.text)

上記の出力から、「!」マークや「-」マークなど今回の分析に必要ない要素が含まれているため、これから取り除きます。

クレイジング

changed_titles2=''.join(titles)
# print(changed_titles2)

# 2.前処理(英語や記号を正規表現で削除)←文字列に対して有効
import re
changed_titles2=re.sub("[a-xA-Z0-9_]","",changed_titles2)#英数字の削除
changed_titles2=re.sub("[!-/:-@[-`{-~]","",changed_titles2)#記号の削除
changed_titles2=re.sub(u"\n\n","\n",changed_titles2)#改行の削除
changed_titles2=re.sub(u"\r","",changed_titles2)#空白の削除
changed_titles2=re.sub(u"\u3000","",changed_titles2)#全角の空白を削除

これで不必要な文字が取り除けましたね。
ここから、形態素解析に入っていきます。

形態素解析

import MeCab

changed_titles2=''.join(changed_titles1) #リストから文字列にする必要があります
text = changed_titles2
m = MeCab.Tagger("-Ochasen")#テキストをパースするためのTaggerインスタンス生成

# 名詞のみを取り除いてみます
nouns = [line for line in m.parse(text).splitlines()#Taggerクラスのparseメソッドを使うと、テキストを形態素解析した結果が返る
               if "名詞" in line.split()[-1]]

for str in nouns:
    print(str.split())

nouns = [line.split()[0] for line in m.parse(text).splitlines()
               if "名詞" in line.split()[-1]]
print(nouns)

tomo=[]
dictionary={}

add_dictionary={"女子大生":4,"ノーパン":2,"レイプ":1}#正しく形態素解析できない部分を訂正
dictionary.update(add_dictionary)

for word in nouns:
    if word in dictionary:
        dictionary[word]+=1
    else:
        dictionary[word]=1
                
dictionary = sorted(dictionary.items(), key=lambda x:x[1],reverse=True)
for key,value in dictionary:
    print(key,value)
    tomo.append(key)

結果は！？

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text_new=""
for i in tomo:
    text_new = text_new + " " + i

stopwords=["七沢","何","度","日","生","ノー","パン","レ","イ","プ","〜","中","女子大"]

word_cloud=WordCloud(background_color='white',font_path=r"C:\Users\tomoh\機械学習 able\ワードクラウド\meiryo.ttc",
                     min_font_size=3,prefer_horizontal=1,stopwords=stopwords)
word_cloud.generate(text_new)

plt.figure(figsize=(10,8))
plt.imshow(word_cloud)
plt.axis("off")
plt.show()