LoginSignup
0
0

More than 5 years have passed since last update.

Coraデータセットの整形

Posted at

Coraデータセット

Coraデータセットは,論文のタイトル,著者,citation,出版年などがあるMLのためのデータセット.
https://sites.google.com/site/semanticbasedregularization/home/software/experiments_on_cora
以下,pythonを使って整形する.

タイトル,著者,発表者を取ってくる

import numpy as np
import pandas as pd
import bs4
import re

content=pd.read_table('./cora/papers', header=None)
content_NoNan=content.dropna()
tag=content_NoNan[2]
paperID=content_NoNan[0]

ここで,<title><author><year>の全てを取ってくる.

title_pre=[]
authors_pre=[]
year_pre=[]
for line in tag:
    soup=bs4.BeautifulSoup(line)
    title_pre.append(soup.find_all('title'))
    authors_pre.append(soup.find_all('author'))
    year_pre.append(soup.find_all('year'))

タグを消す.

dict={}
p = re.compile(r"<[^>]*?>")
for ID,i,j,k in zip(paperID, title_pre, authors_pre, year_pre):
    if len(j) !=0 and len(k) != 0:
        title=p.sub("", str(i[0]))
        authors=p.sub("", str(j[0]))
        year=p.sub("", str(k[0]))
        dict.update({ID:[title, authors, year]})

dictには,paperIDでtitle, author, yearを持ってくる辞書ができる.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0