More than 1 year has passed since last update.

DOIなどのリダイレクト先を取得する

Posted at 2022-08-04

はじめに

学術雑誌のDOI（doi.or）やhandle.net などのURLをPANDASのSeriesから、リダイレクト先（提供元）を取得し、戻します。
DOIなどでなくても、何らかの理由でリダイレクトされていれば構いません。
取得処理自体は、以下をそっくりそのまま利用させていただきました。ありがとうございます。
Python で HTTP リダイレクト先の URL を取得する

変換元データ

import pandas as pd
df = pd.Series(['https://doi.org/10.20730/100240356', 'http://hdl.handle.net/2324/4245',
               'https://repository.dl.itc.u-tokyo.ac.jp/search?search_type=2&q=6092'], index=['1', '2', '3'])

変換

import urllib.request
from time import sleep

# リダイレクトしないハンドラークラス
class NoRedirectHandler(urllib.request.HTTPRedirectHandler):
  # HTTPRedirectHandler.redirect_request をオーバーライド
  def redirect_request(self, req, fp, code, msg, hdrs, newurl):
    self.newurl = newurl  # リダイレクト先URLを保持
    return req

# リダイレクト先 URL を取得する関数
def get_redirect_url(src_url):
  # リダイレクトしないハンドラーをセット
  no_redirect_handler = NoRedirectHandler()
  opener = urllib.request.build_opener(no_redirect_handler)
  try:
    sleep(0.1)
    with opener.open(src_url) as res:
      return src_url
  except urllib.error.HTTPError as e:
    if hasattr(no_redirect_handler, "newurl"):
      return no_redirect_handler.newurl  # リダイレクト先 URL を返す
    else:
      print('else:', src_url)
      return src_url
  except Exception as e:
    print('except Exception:', src_url)
    return src_url
    
df_redirect = df.map(get_redirect_url)

結果

pd.set_option("display.max_colwidth", 200)
print(df_redirect)

1                           http://kotenseki.nijl.ac.jp/biblio/100240356
2               https://www.lib.kyushu-u.ac.jp/publications_kyushu/jagri
3    https://repository.dl.itc.u-tokyo.ac.jp/search?search_type=2&q=6092
dtype: object

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up