4
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

青空文庫から作品をスクレイピングする

Last updated at Posted at 2019-05-19

動機

機械学習で学んだことをアウトプットするために自動文章作成でもやろうと思い立つ。
そのためのデータ集めに青空文庫のデータをスクレイピングした結果、それなりに時間がかかったので覚書きを残しておく。

使うツール

###コード
hoge.htmlとしてデータを保存。

book_id = "937"
!wget http://pubserver2.herokuapp.com/api/v0.1/books/{book_id}/content?format=html -O hoge.html

BeautifulSoup を使ってファイルを読み込み。
このときshift_jisじゃないとエラー吐くので注意

# BeautifulSoup 導入 

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("hoge.html",encoding="shift_jis"))

# find(タグネーム、保有している属性,etc...)
# 本文が書かれている<div>を取り出す
main_text=soup.find("div","main_text").text

print(main_text)

ルビ 例:蜻蛉(とんぼ)と全角スペースを取り除く。

import re
result = re.sub(r'([\(|\(][0-9|a-z|ぁ-ん]*[\)|\)])', "", main_text)
result2 = re.sub(r' ', "", result)
print(result2)

###この先の展望
IDを用いてデータを取り出す。

# 夢野久作の作品IDのスクレイピングを目指す
# 作品数154個
from bs4 import BeautifulSoup
!wget https://www.aozora.gr.jp/index_pages/person96.html -O index.html
soup = BeautifulSoup(open("index.html",encoding="utf-8"))
ol=soup.find("ol").text

import re
num1 = re.findall('(\(.+[0-9]+\))+', ol)
num2 = re.findall('([0-9]{3})', ol)
print(num2)
print(len(num2))

#['937', '238', '467', '914', '531', '237', '467', '919', '467', '466', '466', '214', '467', '210', '211', '212', '467', '928', '467', '929', '930', '932', '466', '466', '212', '449', '209', '466', '468', '940', '921', '213', '467', '922', '112', '923', '223', '466', '467', '467', '456', '107', '212', '211', '214', '108', '924', '230', '237', '437', '213', '209', '210', '213', '477', '934', '466', '238', '210', '212', '209', '920', '935', '210', '211', '212', '936', '112', '212', '230', '211', '213', '466', '210', '230', '466', '213', '210', '467', '467', '467', '938', '213', '213', '213', '214', '209', '466', '467', '467', '939', '466', '209', '467', '467', '209', '212', '211', '467', '210', '211', '467', '444', '214', '467', '467', '210', '467', '214', '467', '211', '918', '915', '238', '210', '105', '230', '917', '916', '467', '467', '467', '467', '467', '467', '237', '926', '211', '913', '467', '211', '927', '110', '212', '212', '209', '467', '467', '111', '467', '210', '467', '942', '214', '933', '467', '212', '925', '112', '211', '213', '941', '213', '931']
#154
4
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?