Help us understand the problem. What is going on with this article?

乃木坂&欅坂46&日向坂46ブログをスクレイピングした

More than 1 year has passed since last update.

はじめに

  • 乃木坂46欅坂46日向坂46のブログをスクレイピングしました。
  • スクレイピングした際のコードを公開してます。興味があれば使ってみて下さい。
  • 使用言語はPythonでBeautiful Soup使ってます。

コード

Pythonスクリプト
import pandas as pd
import re, requests
from bs4 import BeautifulSoup

import datetime
from dateutil.relativedelta import relativedelta

# nogizaka

def get_month_list( start, end ):

    start_dt = datetime.datetime.strptime(start, "%Y%m")
    end_dt = datetime.datetime.strptime(end, "%Y%m")

    lst = []
    t = start_dt
    while t <= end_dt:
        lst.append(t)
        t += relativedelta(months=1)

    return [x.strftime("%Y%m") for x in lst]

def get_nogi_articles_from_single_page( page_soup ):

    titles = [ i.find('span', class_='entrytitle').text for i in page_soup.find_all('h1', class_='clearfix') ]
    authors = [ i.text for i in page_soup.find_all('span', class_='author') ]
    datetimes = [ i.text.split('|')[0].replace(' ','').replace('\n','') for i in page_soup.find_all('div', class_='entrybottom') ]   
    texts = []
    images = []
    for i, j in enumerate( page_soup.find_all('div', class_='entrybody') ): # 画像URL抽出 ... アドホックな処理多いので注意
        image_urls = []
        for img in j.find_all('img', src=re.compile("^http(s)?://img.nogizaka46.com/blog/")):
            if img.get('src')[-4:] == '.gif':continue
            if not img.get('class') == None:
                if img.get('class')[0] == 'image-embed':
                    continue
            image_urls.append( img.get('src') )
        image_str = ''
        for img in image_urls: image_str += '%s\t'% img
        texts.append( j.text )
        images.append( image_str )

    articles = []
    for i in range( len( titles ) ):
        articles.append( [authors[i], datetimes[i], titles[i], texts[i], images[i]] )

    return articles

headers = {'User-Agent':'Mozilla/5.0'}

start = '201111'
end = '201901'
month_list = get_month_list( start, end )

all_articles = []
for date in month_list:
    print( date )
    target_url = 'http://blog.nogizaka46.com/?p=%d&d=%s' % (page_idx, date)
    r = requests.get(target_url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    max_page_number = int( soup.find('div', class_='paginate').find_all('a')[-2].text.replace(u"\xa0",u"") ) # これでいいのか?
    for page_idx in range( 1, max_page_number+1, 1 ):
        target_url = 'http://blog.nogizaka46.com/?p=%d&d=%s' % (page_idx, date)
        r = requests.get(target_url, headers=headers)
        soup = BeautifulSoup(r.text, 'lxml')
        page_articles = get_nogi_articles_from_single_page( soup )
        all_articles.extend( page_articles )

df = pd.DataFrame( all_articles, columns=['author', 'datetime', 'title', 'text', 'images'] )
df.to_csv( '../data/nogizaka46_blog.csv', index=0 )


# keyakizaka

def get_keyaki_articles_from_single_page( page_soup ):

    articles = []
    for i, j in enumerate( page_soup.select('article') ):
        author = j.p.text.replace(' ','').replace('\n','')
        datetime = j.find('div', class_="box-bottom").li.text.replace(' ','').replace('\n','')
        title = j.find('div', class_="box-ttl").a.text
        body = j.find('div', class_="box-article")
        text = body.text
        images = ''
        for img in  body.find_all('img'): images += '%s\t'% img.get('src')
        articles.append( [author, datetime, title, text, images] )

    return articles

max_page_number = 814

all_articles = []
for page_idx in range( max_page_number, -1, -1 ):
    if page_idx % 100 == 0: print(page_idx)
    target_url = 'http://www.keyakizaka46.com/s/k46o/diary/member/list?ima=0000&page=%d&cd=member' % page_idx
    r = requests.get(target_url)
    soup = BeautifulSoup(r.text, 'lxml')
    page_articles = get_keyaki_articles_from_single_page( soup )
    all_articles.extend( page_articles )

df = pd.DataFrame( all_articles, columns=['author', 'datetime', 'title', 'text', 'images'] )
df.to_csv( '../data/keyakizaka46_blog.csv', index=0 )

ちょっとだけ中身確認

pandasで中身確認

image.png

各メンバの投稿記事数をプロット

乃木坂46

plotimage2.png

欅坂46&けやき坂46

plotimage.png

収集したブログを使った解析

ブログ解析による欅坂46メンバの相関図作成

  • 欅坂46メンバの公式ブログを解析し、メンバ間の関係を抽出して相関図を作成した。
  • image.png

ブログ解析による日向坂46メンバの相関図作成

  • image.png

乃木坂46・欅坂46・けやき坂46のブログから単語分散表現を学習

  • 乃木坂46・欅坂46・けやき坂46のブログから単語分散表現を学習した。
  • image.png

修正

  • 2019/01/14 GitHubレポジトリからcsvファイル削除、ダウンロードリンクを削除
myaun
ソーシャルメディア解析,自然言語処理好き/Kaggle expert https://kaggle.com/haradataman
https://www.notion.so/myaun/Blog-75da3e178e924838b56c90e2148760ae
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした