ColaboratoryからGoogle Driveやスプレットシートにデータを入出力する方法をまとめました。

この投稿で紹介しているコードは以下のノートブックにもまとめてあります。
https://drive.google.com/file/d/1CsApp0TJx-qTkuuOyK_PjzdBZUfbPGV7/view?usp=sharing

まえおき

Colaboratoryは同じGoogleのGoogle Driveやスプレットシートと連携しやすいかと思っていたのですが、
Colaboratoryの特別な機能として入出力の仕組みは用意されていませんでした。

公式ドキュメントにも一般的なGoogle Driveアクセス用のモジュールを使用した入出力の例が紹介されているものの、思いのほか手間取ってしまいました。
これからColaboratoryを使ってデータ分析をしてみよう、という方のお役に立てればと思い投稿しました。

データを入力する

ブラウザからファイルをアップロードする

実行するとファイル選択フォームが表示されるので、ローカルファイルをアップロードする事ができます。

from google.colab import files

files = files.upload()
file_name = list(files.keys())[0]
file_string = files[file_name].decode()

file_string

Google Driveからファイルを読み込む

はじめに使用するモジュールのimportとGoogle Driveのアクセス権限を与えます。
実行すると、認可用のURLとコードを入力するフォームが表示されるので、
リンク先でコードを取得してフォームに貼り付けてEnterを押します。

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Google DriveのフォルダのIDからファイルのIDを調べる

Google Driveはフォルダによる階層構造で管理されているように見えますが、実際はタグベースで管理されているため、個別のファイルへはファイルごとに一意に振られているIDを使ってアクセスします。
フォルダのIDはブラウザ上のURLに表示されていますが、ファイルの場合はブラウザ上でIDを調べる事ができないため、フォルダ内のファイル一覧を取得してIDを調べます。

# ブラウザでフォルダを表示するとURLのfolders以降がフォルダのIDになっている
# https://drive.google.com/drive/folders/<ID>
dir_id = "1FhrOPpDmWtSu7UDaCtM703d2KEJvwhlc"

# フォルダ内にあるファイル名とファイルIDを表示する
file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % dir_id}).GetList()
for f in file_list:
  print("name: " + f["title"] + ", id: " + f["id"])

ファイルの共有URLからIDを調べる

ファイルも共有設定になっている場合、共有用のURLにIDが含まれているので、そちらから特定する事もできます。

Path形式でフォルダのIDを調べる

上記でGoogle Driveは階層構造で管理されていないと記載したものの、
ファイルのPathを指定するような形式でアクセスしたい場合もあるかと思います。
コード量は増えてしまいますが、以下のようなコードで実現する事ができます。

# ファイルのPathからIDを検索する
def get_id_list(path_str):
  path_str_list = path_str.strip("/").split("/")
  path_id_list = ["root"]

  # parentsのidで検索してフォルダ、ファイル名が一致するオブジェクトのidを返す
  def find_id(parent_id, path_name):
    file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % parent_id}).GetList()
    for f in file_list:
      if f['title']==path_name:
        return f['id']

    # 親のIDから探してpathのフォルダ名と一致するものが無いとエラー
    raise FileNotFoundError

  for i, s in enumerate(path_str_list):
    path_id_list.append(find_id(path_id_list[i], s))

  tupleList = list(zip(path_str_list, path_id_list[1:]))
  return list(map(lambda x: {"title": x[0], "id": x[1]}, tupleList))

# Path風の指定でファイルのIDを取得する
path_string = "/foo/bar/baz"
get_id_list(path_string)

Google DriveからColaboratoryのローカルにファイルをダウンロードする

ファイルのIDさえわかれば以下のような短いコードでファイルを取得する事ができます。

# 上記の方法で調べたファイルのID
file_id = "1MJlHF-9G74CNjEjeZTTzarE5C0LEFwOT"
drive_file = drive.CreateFile({'id': file_id})

# ファイルの取得
drive_file.GetContentFile("ring_dev_analytics_data.csv")

with open("ring_dev_analytics_data.csv", "r") as f:
  file_string = f.read()

file_string

Google Driveからファイルの内容をPythonの変数に読み込む

ファイルの中身だけを直接変数に読み込む事もできます。

# 上記の方法で調べたファイルのID
file_id = "1MJlHF-9G74CNjEjeZTTzarE5C0LEFwOT"
drive_file = drive.CreateFile({'id': file_id})

# ファイルの内容を取得
drive_file.GetContentString()

Google スプレットシートからデータを読み込む

スプレットシートの場合も認可の準備が必要です。

!pip install --upgrade -q gspread

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

以下はスプレットシートの全データを二次元配列に読み込む例です。

# ファイル名を指定してシートを開く
sht = gc.open('test_sheet')
worksheet = sht.get_worksheet(0)

# 読み込むセルの範囲の指定(読み込みたい範囲がわかっているのであれば適宜書き換える)
row_cnt = worksheet.row_count
col_cnt = worksheet.col_count

cells = worksheet.range(1, 1, row_cnt, col_cnt)

table_data = []
cols = []

for i, cell in enumerate(cells):
  cols.append(cell.value)
  if (i + 1) % col_cnt == 0:
    table_data.append(cols)
    cols = []

table_data

データを出力する

ブラウザからファイルをダウンロードする

以下のコードを実行すると、ブラウザからのダウンロードが始まります。

from google.colab import files

with open('example.txt', 'w') as f:
  f.write('some content')

files.download('example.txt')

Google Driveにファイルを出力する

出力の場合も、読み込み時と同様の準備が必要です。

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

ColaboratoryのローカルファイルをGoogle Driveに出力する

ファイル出力の例です。

with open("upload_file_1.txt", "w") as f:
  f.write("output string 1")

upload_file_2 = drive.CreateFile()
upload_file_2.SetContentFile("upload_file_1.txt")
upload_file_2.Upload()

Pythonの変数からGoogle Driveに出力する

読み込み時と同様に変数から直接出力する事もできます。

upload_file_1 = drive.CreateFile({'title': 'upload_file_2.txt'})
upload_file_1.SetContentString("output string 2")
upload_file_1.Upload()

Google スプレットシートに出力する