More than 3 years have passed since last update.

Pythonでファイルをダウンロードする

Posted at 2021-01-07

言語処理100本ノック 2020 (Rev 2)ではファイルをダウンロードしてから処理することが多い

いろんなやり方があるので収集してみる。

ファイルは第3章: 正規表現のjawiki-country.json.gz

Google Colaboratoryにて実施

requests

requests.py

import requests

url='https://nlp100.github.io/data/jawiki-country.json.gz'
filename='jawiki-country.json.gz'

urlData = requests.get(url).content

with open(filename ,mode='wb') as f: # wb でバイト型を書き込める
  f.write(urlData)

自分のやり方。
Requetsはwgetとおんなじ感じで使えるのがいい。

ファイルの書き込みは定番のやり方

Python、Requestsを使ったダウンロードのように大きなファイルを扱うのもできるのでしょう。

requests2.py

import requests
import os

url='https://nlp100.github.io/data/jawiki-country.json.gz'
filename=os.path.basename(url)

r = requests.get(url, stream=True)
with open(filename, 'wb') as f:
  for chunk in r.iter_content(chunk_size=1024):
    if chunk:
      f.write(chunk)
      f.flush()

直で書き込むにはこちら。

urllib.request

urllib_request.py

import urllib.request

url='https://nlp100.github.io/data/jawiki-country.json.gz'
save_name='jawiki-country.json.gz'

urllib.request.urlretrieve(url, save_name)

今回で調べてみたら出てきました。
pythonでwebからのファイルのダウンロード
ファイルのセーブまでできるすぐれもの

pandas.read_X

read_X.py

import pandas as pd

url='https://nlp100.github.io/data/jawiki-country.json.gz'

df=pd.read_json(url, lines=True)

第２章で大活躍だったpandas

input/outputにある各コマンドはurlから直で読めて圧縮も自動で判別してくれる優れもの。

読み込むとデータフレームになってしまうので使いどころを選ぶけど、そのまま処理するのであればこちらでもいい。

まとめ

コマンドだとwgetでなにも考えなくていいけど、プログラムからやる時はそれなりに考えないといけない。

いずれかの方法でやっていけばいいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up