More than 5 years have passed since last update.

Coraデータセットの整形

python2.7

Posted at 2016-07-22

Coraデータセット

Coraデータセットは，論文のタイトル，著者，citation，出版年などがあるMLのためのデータセット．
https://sites.google.com/site/semanticbasedregularization/home/software/experiments_on_cora
以下，pythonを使って整形する．

タイトル，著者，発表者を取ってくる

import numpy as np
import pandas as pd
import bs4
import re

content=pd.read_table('./cora/papers', header=None)
content_NoNan=content.dropna()
tag=content_NoNan[2]
paperID=content_NoNan[0]

ここで，<title><author><year>の全てを取ってくる．

title_pre=[]
authors_pre=[]
year_pre=[]
for line in tag:
    soup=bs4.BeautifulSoup(line)
    title_pre.append(soup.find_all('title'))
    authors_pre.append(soup.find_all('author'))
    year_pre.append(soup.find_all('year'))

タグを消す．

dict={}
p = re.compile(r"<[^>]*?>")
for ID,i,j,k in zip(paperID, title_pre, authors_pre, year_pre):
    if len(j) !=0 and len(k) != 0:
        title=p.sub("", str(i[0]))
        authors=p.sub("", str(j[0]))
        year=p.sub("", str(k[0]))
        dict.update({ID:[title, authors, year]})

dictには，paperIDでtitle, author, yearを持ってくる辞書ができる．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up