More than 3 years have passed since last update.

GitHubのJupyter notebookをGoogle Colaboratoryで使う時の準備

Last updated at 2021-05-20Posted at 2021-05-08

はじめに

が安かったので、ちょこちょことやっている。

Python関連はColaboratoryからGitHubを開いてやっているけど、データはそのままだと、使えない。

データの取得のほぼ自動化ができたので、参考までに。

他のGitHubにあがっているnotebookを使うときにも参考になると思う。

コード(urlの記載が変わって、使えなくなりました。)

dl_python.ipynb

# ディレクトリ作成
import os

os.makedirs('data' ,exist_ok=True) 
# ここで一応セルを分けている。
#
from bs4 import BeautifulSoup
import requests

# https://www.amazon.co.jp/Hands-Data-Analysis-Pandas-visualization-ebook/dp/B08R67H7F5
# Hands-On Data Analysis with Pandas: A Python data science handbook for data collection
# Chapter 2: Working with Pandas DataFrames

url = 'https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/tree/master/ch_02/data'

r = requests.get(url).content

soup =  BeautifulSoup(r, "html.parser")

file_l = soup.find('div', role="grid").find_all(class_='js-navigation-open Link--primary')
# file_l=divs1.find_all(class_='js-navigation-open Link--primary')
file_names = [ sp.get('title') for sp in file_l]
url_path = url.replace('tree','raw')+'/'
url_path

os.chdir('data')

for filename in file_names:
    url = url_path + filename
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
os.chdir('..')

使い方

を開いたあと、自分のGoogle Driveに保存(ドライブにコピーを保存)して、上記コードを追加・実行するとそのままできる。

解説

os.makedirs()でディレクトリを作成。ディレクトリの有無はexist_ok=Trueで気にしなくてよくなる。
urlはGitHubのページをコピペ。ここは手動。
HPへのアクセスとかは_requests_で統一しました。

レファレンスそのまま

BeautifulSoupはレファレンスというよりは困ったらググってなんとかできた。

ファイル書き込みは、以前の記事のコードを改造

repository直下の場合(こちらを使ってください。)

テーブルの記載とかが違かった。

repogitory_file_dl.ipynb

# ディレクトリ作成
import os

os.makedirs('Data' ,exist_ok=True) 
# ここで一応セルを分けている。
#
from bs4 import BeautifulSoup
import requests

# https://github.com/devrimgunduz/pagila
# Pagila is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. It is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, etc.
# 

url = 'https://github.com/devrimgunduz/pagila'

r = requests.get(url).content

soup =  BeautifulSoup(r, "html.parser")

file_l = soup.find('div', role="grid").find_all(class_='js-navigation-open Link--primary')
url_path = "https://github.com"+os.path.commonprefix([sp.get('href') for sp in file_l])
file_names = [ sp.get('title') for sp in file_l]

os.chdir('Data')

for filename in file_names:
    url = url_path + filename + "?raw=True" 
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
os.chdir('..')

解説

ソースコードを見てみたら、_href_がユーザ名からのpathになっていたので、os.path.commonprefixでファイル名を除いた部分を取得している。
os.path.commonpathだと最後の/がない。
?raw=TrueをURLにつけることで、ファイルをダウンロードできることを知ったので、replaceを除いた。

まとめ

Packtもそうだし、Manningや他のオンライントレーニングもGitHubにnotebookがあがっているので、探してやってみると勉強になると思います。

GitHubを_clone_してローカルでVSCodeかJupyter Notebookが定番のような気がする。　

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up