はじめに

arXivは、様々な分野の論文が投稿されているコーネル大学図書館が運営するサイトで、無料でのPDF閲覧が可能となっている。

自分の見たい情報を解析して、これと思うものをtwitterに投稿すれば、探す手間が省けると考えたが、まずは第１段階として
arXivのフィードをtwitterにつぶやくことにした。

今回の対象は、自分が興味を持っているカテゴリであるcs.CVとした。

常時稼働させるのでラズパイを選択したが、pythonが動作するものであれば、特に情報機器側の制約はない。

arXivのRSS feedの仕組み

ヘルプページの以下２つの内容を読むとわかる。

重要事項としては、arXiv API User's Manual 3.3.1.1に書かれた「更新は1日に1回である」ということ。頻繁にアクセスしても情報更新されないので、APIを呼び出す頻度やキャッシュの仕組みを考えて設計する必要がある、と、示唆されている。

Because the arXiv submission process works on a 24 hour submission cycle, new articles are only available to the API on the midnight after the articles were processed. The tag thus reflects the midnight of the day that you are calling the API. This is very important - search results do not change until new articles are added. Therefore there is no need to call the API more than once in a day for the same query. Please cache your results. This primarily applies to production systems, and of course you are free to play around with the API while you are developing your program!

フィードのxmlは、下記の記述のカテゴリ名を入れ替えることで取得できる。

http://export.arxiv.org/rss/cs.CV/rss.xml

カテゴリ一覧は、ここにある。

pythonプラグラム

参考文献

以下の情報を参考にして、プログラムを作成した。

Getting started with the Twitter API
Raspberry Pi2 + Twythonでニュース bot を作ろう

作成ポイント

ライブラリ

以下のライブラリをimportして設計した。

twython
feedparser

auth key情報

まずはお手本通りに、twitterのauth key関連情報をauth.pyにまとめて記載する。

auth.py

consumer_key        = 'ABCDEFGHIJKLKMNOPQRSTUVWXYZ'
consumer_secret     = '1234567890ABCDEFGHIJKLMNOPQRSTUVXYZ'
access_token        = 'ZYXWVUTSRQPONMLKJIHFEDCBA'
access_token_secret = '0987654321ZYXWVUTSRQPONMLKJIHFEDCBA'

フィードの取り込み

RSS_URLに取り込みたいフィードのxmlのURLを入力し、PUBDATE_LOGで指定したファイルにアップデートのログ（日時：updated）を残す。

プログラムの中でPUBDATE_LOGで指定したファイルをチェックするようにしたかったが、そこまで実装できていないので、あらかじめ

$ touch cs.CV.log

で空ファイルを作っておく必要がある。。。

your LOG dirはこのプログラムの場所。cronで自動実行を仕掛ける場合は、絶対パスで記述しておく必要がある。

RSS_URL = "http://export.arxiv.org/rss/cs.CV/rss.xml"
PUBDATE_LOG = "/your LOG dir/cs.CV.log"

news_dicにフィードの内容を辞書形式で保存し、必要な情報をtwythonでtwitterに投稿する。現時点のarXivのフィードの内容は以下で、プログラムのコメントに記載しておいた。

news_dic = feedparser.parse(RSS_URL)

"""
new_dic.* : 
updated_parsed
etag
encoding
version
updated
headers
entries
namespaces
bozo
href
status
feed

print(news_dic.updated_parsed)  
print(news_dic.etag          )  #time.struct_time(tm_year=2017, tm_mon=8, tm_mday=16, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=2, tm_yday=228, tm_isdst=0)
print(news_dic.encoding      )  #us-ascii
print(news_dic.version       )  #rss10
print(news_dic.updated       )  #Wed, 16 Aug 2017 00:30:00 GMT
print(news_dic.headers       )  #{'Expires': 'Thu, 17 Aug 2017 00:00:00 GMT', 'Connection': 'close', 'ETag': '"Wed, 16 Aug 2017 00:30:00 GMT", "1502843400"', 'Server': 'Apache', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Type': 'text/xml', 'Content-Length': '15724', 'Date': 'Wed, 16 Aug 2017 06:43:57 GMT', 'Last-Modified': 'Wed, 16 Aug 2017 00:30:00 GMT', 'Content-Encoding': 'gzip'}
print(news_dic.entries       )  #CONTENTS OF RSS FEED!!
print(news_dic.namespaces    )  #{'': 'http://purl.org/rss/1.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'content': 'http://purl.org/rss/1.0/modules/content/', 'sy': 'http://purl.org/rss/1.0/modules/syndication/', 'dc': 'http://purl.org/dc/elements/1.1/', 'admin': 'http://webns.net/mvcb/', 'taxo': 'http://purl.org/rss/1.0/modules/taxonomy/'}
print(news_dic.bozo          )  #0
print(news_dic.href          )  #http://export.arxiv.org/rss/cs.CV/rss.xml
print(news_dic.status        )  #200
print(news_dic.feed          )  
"""

PubIDとlastPubIDでフィードの情報が更新されているかをチェックし、更新されていなければプログラムを終了する。更新されていたらPUBDATE_LOGが指すファイルを上書き更新する。

pubID = news_dic.updated

#  pubID 
with open(PUBDATE_LOG, "r") as rf:
    lastPubID = rf.readline().rstrip("\n")

# 
if (pubID == lastPubID):
    print("")
    sys.exit()
else:
    with open(PUBDATE_LOG, "w") as f:
        f.write(pubID + "\n")

twitterに投稿

new_dic.entries内には以下の項目がある。

title
link
description (abstract)
creator

投稿する情報はnew_dic.entries内のtitleとlinkにした。ただしtitleは長い場合があるので、確実に140文字以内に収め、URLリンクが記載できるように、文字数を制限している。

for i in news_dic.entries:
    if len(i.title) > 100:
        message = i.title[0:100] + "......\n" + i.link
    else:
        message = i.title[0:109] + "\n" + i.link
    #print(len(message))
    #print(message)

    try:
        twitter.update_status(status=message)
    except TwythonError as e:
        print(e)

作成結果

試行錯誤の結果の、最終のプログラムは以下。

twitter_feed_arxiv_cs.CV.py

# coding: utf-8
from twython import Twython, TwythonError
import feedparser
import sys

from auth import (
    consumer_key,
    consumer_secret,
    access_token,
    access_token_secret
)

twitter = Twython(
    consumer_key,
    consumer_secret,
    access_token,
    access_token_secret
)

RSS_URL = "http://export.arxiv.org/rss/cs.CV/rss.xml"
PUBDATE_LOG = "/<your LOG dir>/cs.CV.log"
"""

touch cs.CV.log
cron
"""

news_dic = feedparser.parse(RSS_URL)

"""
new_dic.* : 
updated_parsed
etag
encoding
version
updated
headers
entries
namespaces
bozo
href
status
feed

print(news_dic.updated_parsed)  
print(news_dic.etag          )  #time.struct_time(tm_year=2017, tm_mon=8, tm_mday=16, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=2, tm_yday=228, tm_isdst=0)
print(news_dic.encoding      )  #us-ascii
print(news_dic.version       )  #rss10
print(news_dic.updated       )  #Wed, 16 Aug 2017 00:30:00 GMT
print(news_dic.headers       )  #{'Expires': 'Thu, 17 Aug 2017 00:00:00 GMT', 'Connection': 'close', 'ETag': '"Wed, 16 Aug 2017 00:30:00 GMT", "1502843400"', 'Server': 'Apache', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Type': 'text/xml', 'Content-Length': '15724', 'Date': 'Wed, 16 Aug 2017 06:43:57 GMT', 'Last-Modified': 'Wed, 16 Aug 2017 00:30:00 GMT', 'Content-Encoding': 'gzip'}
print(news_dic.entries       )  #CONTENTS OF RSS FEED!!
print(news_dic.namespaces    )  #{'': 'http://purl.org/rss/1.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'content': 'http://purl.org/rss/1.0/modules/content/', 'sy': 'http://purl.org/rss/1.0/modules/syndication/', 'dc': 'http://purl.org/dc/elements/1.1/', 'admin': 'http://webns.net/mvcb/', 'taxo': 'http://purl.org/rss/1.0/modules/taxonomy/'}
print(news_dic.bozo          )  #0
print(news_dic.href          )  #http://export.arxiv.org/rss/cs.CV/rss.xml
print(news_dic.status        )  #200
print(news_dic.feed          )  
"""

pubID = news_dic.updated

#  pubID 
with open(PUBDATE_LOG, "r") as rf:
    lastPubID = rf.readline().rstrip("\n")

# 
if (pubID == lastPubID):
    print("")
    sys.exit()
else:
    with open(PUBDATE_LOG, "w") as f:
        f.write(pubID + "\n")

for i in news_dic.entries:
    if len(i.title) > 100:
        message = i.title[0:100] + "......\n" + i.link
    else:
        message = i.title[0:109] + "\n" + i.link
    #print(len(message))
    #print(message)

    try:
        twitter.update_status(status=message)
    except TwythonError as e:
        print(e)

つぶやく

以下を実行し、自分のtwitterアカウントに投稿されることを確認した。

$ python3 twitter_feed_arxiv_cs.CV.py

別ファイルでcs.RO用のログファイルとプログラムを作成したが、同様に成功した。

つぶやきの自動化

cronで1日1回つぶやかせる。00:30:00 GMTに更新されるようなので、毎日10:00（JST）にフィードを見に行くように設定する。

$ crontab -e

エディタが起動したら、毎日10:00にフィードを見に行くように設定。your LOG dirはこのプログラムの場所。

00 10 * * * python3 /your LOG dir/twitter_feed_arxiv_cs.CV.py >/dev/null 2>&1

さいごに

まずは単純につぶやけるようにはなったが、cs.CVとcs.ROで合わせて毎日50以上の投稿があるようなので、興味のある論文を効率よく探すには、さらに投稿を絞り込む必要がある。

titleやdescriptionの文字列を解析すればできそう。機械学習の例題になるかも。

ラズパイからpythonでarXivのRSS feedをtwitterにつぶやいてみる