More than 1 year has passed since last update.

arXiv から最新の研究を Notion に時間指定で保存する GAS スクリプトを作ってみた

Posted at 2023-11-13

arXivには毎日多くの新しい研究論文が投稿されています。この論文情報をSlackへの自動送信する方法は見かけますが、Slackは無料プランだと90日しかデータを保持しません。そこで記録と表示に向いているNotionに保存したいなと思いました。
この記事では、Google Apps Script（GAS）を利用してarXivから最新の物理学関連論文を定期的に取得し、それをNotionデータベースに自動的に保存する方法を解説します。

何ができるの？

arXivに投稿された指定したカテゴリの最新の論文を検索。
論文のタイトル、著者、概要、公開日、URL、カテゴリを取得。
OpenAIのChatGPTを使って、論文の概要を日本語で要約しNotionデータベースに保存。

どうやって使うの？

ステップ１：Notionデータベースの準備

まずは、Notionで新しいデータベースを作成し、以下ようにカラムを作成してください。

プロパティ名	説明	プロパティタイプ
Title	論文のタイトルを保存	タイトル
Authors	著者名を保存	リッチテキスト
Abstract	論文の概要を保存	リッチテキスト
Publication	論文の公開日を保存	日付
DOI	論文のURLを保存	URL
Categories	論文が属するカテゴリを保存	マルチセレクト
Summary	論文の要約を保存	リッチテキスト

ステップ２：必要なAPIとIDを取得と保存

取得

まず、以下のものを準備しましょう。

NotionのIntegration Tokenを取得。詳しくはドキュメント
作成したNotionのデータベースのID
- データベースのリンクのhttps://www.notion.so/<データベースID>?v=xxxxxxxの部分。
OpenAIのAPI Key

保存場所

取得したNotionのAPIトークンとデータベースID、OpenAIのAPIキーは、Google Apps Scriptのプロジェクトのプロパティに保存します。これで、スクリプトがこれらの情報を安全に使用できます。

Googleスクリプトエディタの「ファイル」>「プロジェクトのプロパティ」から「スクリプトのプロパティ」タブを選択し、それぞれの情報をキーと値のペアで入力してください。今回であれば、NotionトークンのキーはNOTION_TOKEN、データベースIDはDATABASE_ID、OpenAIのAPI KeyはOPENAI_API_KEY、値は取得したトークンそのものです。

ステップ３：スクリプトの設定

Googleスクリプトエディタを起動します。そして、以下のスクリプト全文を貼り付けてください。

code.gs

// Constants
const ARXIV_API_URL = "http://export.arxiv.org/api/query";
const NOTION_API_URL = "https://api.notion.com/v1/pages";
const NOTION_VERSION = '2022-06-28';
const NOTION_TOKEN = PropertiesService.getScriptProperties().getProperty('NOTION_TOKEN');
const NOTION_DATABASE_ID = PropertiesService.getScriptProperties().getProperty('DATABASE_ID');
const OPENAI_API_KEY = PropertiesService.getScriptProperties().getProperty('OPENAI_API_KEY');

/**
 * Helper function to encode URL parameters
 * @param {Object} params - The parameters to encode
 * @returns {string} - Encoded URL parameters
 */
function encodeParams(params) {
  return Object.keys(params).map(key => `${encodeURIComponent(key)}=${encodeURIComponent(params[key])}`).join('&');
}

// Fetches articles already present in the Notion database to prevent duplicates
function getExistingTitlesInNotion() {
  const headers = {
    "Authorization": `Bearer ${NOTION_TOKEN}`,
    "Content-Type": "application/json",
    "Notion-Version": NOTION_VERSION
  };
  
  const response = UrlFetchApp.fetch(`https://api.notion.com/v1/databases/${NOTION_DATABASE_ID}/query`, {
    "method": "post",
    "headers": headers,
    "payload": JSON.stringify({}),
    "muteHttpExceptions": true
  });

  if (response.getResponseCode() !== 200) {
    Logger.log(`Notion query failed with response: ${response.getContentText()}`);
    return [];
  }
  
  const results = JSON.parse(response.getContentText()).results;
  return results.map(page => page.properties.Title.title[0].text.content).filter(title => title);
}

/**
 * Fetches articles from the arXiv API for the specified categories
 * @returns {Object[]} - The list of fetched articles
 */
function fetchArxivArticles() {
  // 'physics.chem-ph', 'physics.comp-ph'
  const categories = ['cond-mat.mtrl-sci', 'physics.comp-ph'];
  let allArticles = [];

  categories.forEach(category => {
    const searchParams = {
      search_query: `cat:${category}`,
      sortBy: 'submittedDate',
      sortOrder: 'descending',
      max_results: 5
    };

    const queryUrl = `${ARXIV_API_URL}?${encodeParams(searchParams)}`;
    const xml = UrlFetchApp.fetch(queryUrl).getContentText();
    allArticles = allArticles.concat(parseArxivXml(xml));
  });

  return allArticles;
}

/**
 * Parses an XML response from arXiv and extracts article information
 * @param {string} xml - The XML string to parse
 * @returns {Object[]} - The list of article objects
 */
function parseArxivXml(xml) {
  const document = XmlService.parse(xml);
  const entries = document.getRootElement().getChildren('entry', XmlService.getNamespace('http://www.w3.org/2005/Atom'));
  return entries.map(entry => {
    const title = entry.getChild('title', XmlService.getNamespace('http://www.w3.org/2005/Atom')).getText().trim();
    const authors = entry.getChildren('author', XmlService.getNamespace('http://www.w3.org/2005/Atom'))
                      .map(author => author.getChild('name', XmlService.getNamespace('http://www.w3.org/2005/Atom')).getText())
                      .join(', ');
    const abstract = entry.getChild('summary', XmlService.getNamespace('http://www.w3.org/2005/Atom')).getText()
                          .replace(/(\r\n|\n|\r)/gm, " ")
                          .replace(/\s+/g, " ");
    const publicationDate = entry.getChild('published', XmlService.getNamespace('http://www.w3.org/2005/Atom')).getText();
    const arxivId = entry.getChild('id', XmlService.getNamespace('http://www.w3.org/2005/Atom')).getText().trim();
    const doi = arxivId;
    const categoryTerms = entry.getChildren('category', XmlService.getNamespace('http://www.w3.org/2005/Atom'))
                               .map(cat => cat.getAttribute('term').getValue())
                               .join(', ');

    return { title, authors, abstract, publicationDate, doi, categories: categoryTerms };
  });
}

/**
 * Saves a single article's information to a Notion database
 * @param {Object} article - The article object to save
 */
function saveArticleToNotion(article, existingTitles) {
  // Check if the article already exists in Notion to prevent duplicates
  if (existingTitles.includes(article.title)) {
    Logger.log(`Skipping save since article already exists: ${article.title}`);
    return;
  }
  const summary = summarizeAbstract(article.abstract);
  if (!summary) {
    Logger.log(`Failed to summarize: ${article.title}`);
    return;
  }

  const headers = {
    "Authorization": `Bearer ${NOTION_TOKEN}`,
    "Content-Type": "application/json",
    "Notion-Version": NOTION_VERSION
  };

  const payload = {
    "parent": { "database_id": NOTION_DATABASE_ID },
    "properties": {
      "Title": { "title": [{ "text": { "content": article.title } }] },
      "Authors": { "rich_text": [{ "text": { "content": article.authors } }] },
      "Abstract": { "rich_text": [{ "text": { "content": article.abstract } }] },
      "Publication": { "date": { "start": article.publicationDate } },
      "DOI": { "url": article.doi },
      "Categories": { "multi_select": article.categories.split(', ').map(cat => ({ "name": cat })) },
      "Summary": { "rich_text": [{ "text": { "content": summary } }] }
    }
  };

  const response = UrlFetchApp.fetch(NOTION_API_URL, {
    "method": "post",
    "headers": headers,
    "payload": JSON.stringify(payload),
    "muteHttpExceptions": true
  });

  const responseCode = response.getResponseCode();
  if (responseCode === 200) {
    Logger.log(`Saved: '${article.title}'`);
  } else {
    Logger.log(`Failed to save: '${article.title}' - Response code: ${responseCode}`);
  }
}

/**
 * Summarizes the abstract using OpenAI's Chat API
 * @param {string} abstract - The abstract to summarize
 * @returns {string} - The summarized abstract
 */
function summarizeAbstract(abstract) {
  // Pre-process the abstract text to remove excessive newlines and spaces
  abstract = abstract.replace(/(\r\n|\n|\r)/gm," ").replace(/\s+/g," ");
  
  // Construct the prompt for OpenAI
  const messages = [
    {
      "role": "system",
      "content": "You are a researcher with a strong background in chemistry and excel at concisely summarizing papers. Please read the abstract of the following paper and explain the two main points in Japanese, each with a line break for each. The main points should be summarized in 2 lines with bullet points. lang: ja"
    },
    {
      "role": "user",
      "content": abstract
    }
  ];
  
  // Call OpenAI's Chat API
  const response = UrlFetchApp.fetch("https://api.openai.com/v1/chat/completions", {
    "method": "post",
    "headers": {
      "Authorization": "Bearer " + OPENAI_API_KEY,
      "Content-Type": "application/json"
    },
    "payload": JSON.stringify({ model: "gpt-3.5-turbo", messages: messages }),
    "muteHttpExceptions": true
  });
  
  const responseCode = response.getResponseCode();
  const responseBody = JSON.parse(response.getContentText());
  
  if (responseCode === 200 && responseBody.choices && responseBody.choices.length > 0) {
    const choice = responseBody.choices[0];
    if (choice.message && choice.message.content) {
      // Return the summary provided by OpenAI
      return choice.message.content.trim();
    }
  } else {
    // Log the error and return null
    Logger.log(`OpenAI request failed with response: ${response.getContentText()}`);
    return null;
  }
}

function main() {
  const existingTitles = getExistingTitlesInNotion(); // Get existing titles from Notion
  const articles = fetchArxivArticles(); // Fetch articles from arXiv

  articles.forEach(article => {
    if (!existingTitles.includes(article.title)) { // Check if the article title is not in the existing titles list
      const summary = summarizeAbstract(article.abstract); // Generate a summary for the abstract
      if (summary) { // If a summary is generated, continue with saving the article
        article.summary = summary; // Append the summary to the article object
        saveArticleToNotion(article, existingTitles); // Pass existingTitles to the function for a final check
        existingTitles.push(article.title); // Add the saved article title to the list to prevent future duplicates
      } else {
        Logger.log(`Failed to summarize: ${article.title}`); // Log any failures in summary generation
      }
    } else {
      Logger.log(`Skipping duplicate: ${article.title}`); // Log skipped duplicates
    }
  });
}

function run() {
  main(); // Start the process
}

ステップ４：スクリプトの実行

設定が完了したら、「run」関数を実行してスクリプトを起動します。これで、arXivから論文情報が取得され、Notionに自動で保存されます。

ステップ５：結果の確認

実行後は、Notionのデータベースを確認して、新しい論文が追加されているかをチェックしてみてください。

自動実行の設定

Google Apps Scriptのトリガーを設定することで、スクリプトを定期的に実行させることができます。これで、毎日特定の時間に自動で論文情報を取得し、Notionデータベースを更新します。

注意点

このスクリプトは1回の実行につき、各カテゴリから最新の論文を1つずつ取得します。もっと多くの論文を取得したい場合は、max_resultsの値を変更してください。
論文の概要はOpenAIのAPIを使って要約されますが、APIの制限や応答によっては要約が生成されないこともあります。その場合は、要約無しで保存されます。
プロンプトはまだまだ改善できるはず。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up