More than 3 years have passed since last update.

Excelを用いたWebスクレイピングでテキストデータとして取得

Last updated at 2021-09-28Posted at 2021-09-26

はじめまして，ガンちゃんです．
Qiita初投稿です！

まず簡単に自己紹介させていただきます．
ニックネームがガンちゃん，ものづくりとアニメが大好きな大学生です．
（学生：2021年現在）
プログラミング初心者に等しいので，何卒よろしくお願いします．

概要

最近Webスクレーピングについて勉強する機会があり，成果の一部を投稿をしました．

Excel内にスクレーピングしたいURLを保存しておき，それをスクレイピングしていきます．スクレイピングした結果は，各URLの内容がテキストファイルとして出力されます．言語はPythonを使用しています．

引数処理

まず，スクレーピングを実施する前にExcelデータを読み込むために，コマンドライン引数を取得するようにしています．

> python hp_scraping.py List_URLs

3番目にある文字列が今回使用するExcelファイルの名前です．

下記にコマンドライン引数での処理を示している．

hp_scraping_一部.py

import sys

args = sys.argv

if len(args) < 2:
    print("Number of args is No!")
    print("Please input excel file name to args[1]!!")
    sys.exit()

excell_name = args[1]

Excelを用いたWebスクレーピング

処理内容は下記になる．
今回スクレイピングしているのは，「body」内のテキストデータを取得している．

hp_scraping.py

import requests 
from bs4 import BeautifulSoup
import sys
import openpyxl

args = sys.argv

if len(args) < 2:
    print("Number of args is No!")
    print("Please input excel file name to args[1]!!")
    sys.exit()

excell_name = args[1]

# ブックを取得
book = openpyxl.load_workbook( excel_name + '.xlsx' )
# シートを取得 
sheet = book['Sheet_list']

# nameとURLの識別用の引数
num = 1

for rows in sheet.iter_rows(min_row=2):
    for cell in rows:
        if num%2 != 0:
            #出力結果を書き込む際の名前
            name = cell.value
            #print(name)
            num += 1
        elif num%2 == 0:
            #URLを取得
            url = cell.value
            if 'http' not in str(url):
                sys.exit()      
            #スクレイピング実行
            html = requests.get(url)
            soup = BeautifulSoup(html.content, "html.parser")
            a = soup.find("body").text
            #テキストファイルに書き込む
            with open("res_" + name + ".txt", mode = 'wb') as f:
                b = a.encode('cp932', 'ignore')
                f.write(b)
            num -= 1

Excell内は下記のようになっている．

今回スクレイピングしたURLは下記の練習用とされているものを使用させていただいております．
https://scraping-for-beginner.herokuapp.com/
https://scraping-for-beginner.herokuapp.com/ranking/

スクレイピングが行なわれた後，ファイル内では下記のように「res_名前.txt」のテキストファイルできる．

最後に

スクレイピングしたURK内のテキストデータ内に改行が加わってしまっています．改善が必要となっています．また，「クローリング」というWebサイトを巡回する方法もあるので，このような方法をとる必要はないかもしれません．

参考リンク

スクレイピング（BeautifulSoup）

10分で理解する Beautiful Soup：
https://qiita.com/Chanmoro/items/db51658b073acddea4ac
PythonでHTMLを解析してデータ収集してみる？スクレイピングが最初からわかる『Python 2年生』：
https://codezine.jp/article/detail/12230

コマンドライン引数

Pythonでコマンドライン引数を受け取る：
https://qiita.com/taashi/items/07bf75201a074e208ae5

Excel処理

Python】Excelのセル範囲（数値・記号）を取得する：
https://pg-chain.com/python-excel-cell-range
PythonでExcelファイル（xlsx）を読み書きするopenpyxlの使い方：
https://note.nkmk.me/python-openpyxl-usage/
openpyxl による Excelファイル操作方法のまとめ：
https://gammasoft.jp/support/how-to-use-openpyxl-for-excel-file/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up